FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

1Renmin University of China 2Beijing Key Laboratory of Big Data Management and Analysis Methods 3Tsinghua University 4Beijing Normal University 5ByteDance
*Equal Contribution Corresponding author
front image

FlexWorld generates high-quality videos with camera control and flexible-view 3D scenes progressively.

Abstract

We introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming.

Video Results with Given Trajectory

FlexWorld generates videos with point clouds and a given trajectory, each row represents the generated videos under the same trajectory.

cabin.
car.
castle.
desert.
cabin.
bear.
castle.
desert.
cabin.
car.
castle.
desert.

More Video Results

Scene Generation Results

Methods

FlexWorld trains a strong V2V diffusion capable of generating high-quality videos from incomplete views rendered from coarse 3D scenes. It progressively expands the 3D scene by adding new 3D content estimated from the refined videos via a dense stereo model. Ultimately, from a single image, it yields a detailed 3D scene capable of rendering flexible viewpoints.

pipeline image.

BibTeX

@misc{chen2025flexworldprogressivelyexpanding3d,
        title={FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis}, 
        author={Luxi Chen and Zihan Zhou and Min Zhao and Yikai Wang and Ge Zhang and Wenhao Huang and Hao Sun and Ji-Rong Wen and Chongxuan Li},
        year={2025},
        eprint={2503.13265},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.13265}, 
  }