DiffusionGS: Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

1 Johns Hopkins University   2 Adobe Research   3 HKUST  

A one-stage 3DGS-based diffusion generates objects and scenes from a single view in ~6 seconds.


Generation results of our method. For objects, the prompt views are in the left dashed box. The generated novel views and Gaussian point clouds are on the right. For scenes, our model can handle hard cases with occlusion and rotation, as shown in the dashed boxes of the third row. The text-to-3D demos are prompted by stable diffusion and Sora for objects and scenes.

Method Overview

Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The text-to-3D demo also reveals the practical values of our method.

The Overall Framework of Our DiffusionGS Pipeline. (a) When selecting the data for our scene-object mixed training, we impose two angle constraints on the positions and orientations of the viewpoint vectors to guarantee the convergence of the training process. (b) The denoiser of DiffusionGS in a single timestep, which directly outputs pixel-aligned 3D Gaussian point clouds.

Object-level Generation Results

ABO Hard Cases

Thumbnail
Thumbnail

Thumbnail
Thumbnail



GSO Hard Cases

Thumbnail
Thumbnail

Thumbnail
Thumbnail



Open Illumination (Real Camera)

Thumbnail
Thumbnail

Thumbnail
Thumbnail



Text-to-Image (Prompted by Stable Diffusion)

Thumbnail
Thumbnail

Thumbnail
Thumbnail



Text-to-Image (Prompted by FLUX)

Thumbnail
Thumbnail

Thumbnail
Thumbnail



Scene-level Generation Results

(The first frame is the prompt view and other frames are generated by our DiffusionGS)

Indoor Scene Generation

Thumbnail
Thumbnail

Thumbnail
Thumbnail


Outdoor Scene Generation

Thumbnail
Thumbnail

Thumbnail
Thumbnail


Text-to-Image (The first frame is prompted by Sora)

Thumbnail
Thumbnail

Thumbnail
Thumbnail