We introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming.
FlexWorld generates videos with point clouds and a given trajectory, each row represents the generated videos under the same trajectory.
FlexWorld trains a strong V2V diffusion capable of generating high-quality videos from incomplete views rendered from coarse 3D scenes. It progressively expands the 3D scene by adding new 3D content estimated from the refined videos via a dense stereo model. Ultimately, from a single image, it yields a detailed 3D scene capable of rendering flexible viewpoints.
@misc{chen2025flexworldprogressivelyexpanding3d,
title={FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis},
author={Luxi Chen and Zihan Zhou and Min Zhao and Yikai Wang and Ge Zhang and Wenhao Huang and Hao Sun and Ji-Rong Wen and Chongxuan Li},
year={2025},
eprint={2503.13265},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.13265},
}