PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

1 Meta Superintelligence Labs   2 Johns Hopkins University   3 Meta BizAI   4 CUHK  

A data construction pipeline and a new DPO framework for physically consistent Text-to-video generation


Data Construction Pipeline

Our physics-augmented video data construction pipeline first adopts a VLM, Qwen-2.5-72B-Instruct, following our designed chain-of-thought rule in (b) to select text-video pairs that contain rich physics interactions and phenomena from a large-scale high-quality text-video data pool in (a). Then in (d), we perform action clustering on the filtered data pairs from (c) through the semantics matching via a sentence Transformer. Subsequently, in (e), we adopt a physics-aware VLM to evaluate the difficulty of different action categories and then sample the text-video pairs accordingly as the wining cases of training data.



Method Overview

Overview of our PhyGDPO. The text prompts without physics reasoning extension in (a) are feed into the T2V model. We propose a LoRA-switch reference scheme in (b) to save the GPU memory and increase training stability. In (c), PhyGDPO is based on the groupwise Plackett-Luce (PL) probabilistic model and adopts a physics-aware VLM to reward the DPO training.



(a) Main Comparison with State-of-the-Art Methods

(a1) A player rides their horse, preparing to strike the ball, their mallet poised.

(Hint: Previouse methods all fail to generate the striking interaction between the mallet and the ball)

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a2) A soccer player runs, plants their foot, and drop kicks a soccer ball high into the air, the ball arcing visibly.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a3) A gymnast drops from the parallel bars and lands safely on the mat below.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a4) A gymnast performs a transition from a front support to a back support on the pommel horse.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a5) A player dunks the basketball, the basketball soaring upward before slamming through the net.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a6) A person wearing a helmet performs a handspring over a platform.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a7) A person plays squash on an indoor court.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a8) A gymnast stands on a balance beam, then performs a forward roll, landing smoothly.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a9) A wooden pencil is carefully dipped into a glass of crystal-clear water, showing the intriguing visual shifts and reflections caused by the interaction between the pencil and the liquid.

(Hint: The curved cup wall together with the water forms a convex-lens-like structure that produces a magnifying effect, and the water surface also introduces refraction. However, previous methods all fail in generating such complex physical phenomena.)

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a10) A small burning ball of paper was thrown into a pile of dry paper.

(Hint: In Sora2’s generated videos, the burning ball fails to ignite the surrounding dry paper. In Veo3.1’s generated video, the burning ball exhibits an unrealistic violent explosion, producing scattered sparks.)

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a11) A golfer addresses the ball, takes a backswing, and hits the ball, sending it arcing high into the air.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a12) A metal pipe smashes a pumpkin, causing its insides to spill out.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a13) A baseball bat smashes a glass bottle, sending shards flying in all directions.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a14) A timelapse captures the transformation as fog in a mountainous region comes into contact with a cold car windshield.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a15) A volleyball player takes a hard swing, the ball contacting the palm and fingers, followed by a rapid downward motion sending the ball over the net.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)


(a16) A surfer dives underwater to avoid a breaking wave, resurfacing seconds later.

PhyT2V (CVPR 2025)
VideoDPO (CVPR 2025)
Wan2.1-T2V-14B
OpenAI Sora2
Google DeepMind Veo3.1
Wan2.1-T2V-14B + PhyGDPO (Ours)



(b) Comparison with State-of-the-Art DPO Frameworks

A weightlifter completes a snatch with a 25kg barbell, holding it momentarily overhead.

Wan2.1-T2V-1.3B
Wan2.1-T2V-1.3B + VideoDPO (CVPR 2025)
Wan2.1-T2V-1.3B + Flow-DPO (NeurIPS 2025)
Wan2.1-T2V-1.3B + PhyGDPO (Ours)


(c) Break-down Ablation Study of PhyGDPO

A tennis ball is gently placed on the surface of a bucket filled with water.

Wan
Wan + LoRA-SR
Wan + LoRA-SR + Groupwise Model
Wan + LoRA-SR + Groupwise Model + Physics Rewarding


(d) Ablation Study of LoRA-SR

A person shovels snow from a garden path, leaving a clear path.

Wan2.1-T2V-1.3B
Wan2.1-T2V-1.3B + LoRA-SFT
Wan2.1-T2V-1.3B + PhyGDPO w/o LoRA-SR
Wan2.1-T2V-1.3B + PhyGDPO with LoRA-SR


Back to top