Our physics-augmented video data construction pipeline first adopts a VLM, Qwen-2.5-72B-Instruct, following our designed chain-of-thought rule in (b) to select text-video pairs that contain rich physics interactions and phenomena from a large-scale high-quality text-video data pool in (a). Then in (d), we perform action clustering on the filtered data pairs from (c) through the semantics matching via a sentence Transformer. Subsequently, in (e), we adopt a physics-aware VLM to evaluate the difficulty of different action categories and then sample the text-video pairs accordingly as the wining cases of training data.
Overview of our PhyGDPO. The text prompts without physics reasoning extension in (a) are feed into the T2V model. We propose a LoRA-switch reference scheme in (b) to save the GPU memory and increase training stability. In (c), PhyGDPO is based on the groupwise Plackett-Luce (PL) probabilistic model and adopts a physics-aware VLM to reward the DPO training.
(a1) A player rides their horse, preparing to strike the ball, their mallet poised.
(Hint: Previouse methods all fail to generate the striking interaction between the mallet and the ball)
(a2) A soccer player runs, plants their foot, and drop kicks a soccer ball high into the air, the ball arcing visibly.
(a3) A gymnast drops from the parallel bars and lands safely on the mat below.
(a4) A gymnast performs a transition from a front support to a back support on the pommel horse.
(a5) A player dunks the basketball, the basketball soaring upward before slamming through the net.
(a6) A person wearing a helmet performs a handspring over a platform.
(a7) A person plays squash on an indoor court.
(a8) A gymnast stands on a balance beam, then performs a forward roll, landing smoothly.
(a9) A wooden pencil is carefully dipped into a glass of crystal-clear water, showing the intriguing visual shifts and reflections caused by the interaction between the pencil and the liquid.
(Hint: The curved cup wall together with the water forms a convex-lens-like structure that produces a magnifying effect, and the water surface also introduces refraction. However, previous methods all fail in generating such complex physical phenomena.)
(a10) A small burning ball of paper was thrown into a pile of dry paper.
(Hint: In Sora2’s generated videos, the burning ball fails to ignite the surrounding dry paper. In Veo3.1’s generated video, the burning ball exhibits an unrealistic violent explosion, producing scattered sparks.)
(a11) A golfer addresses the ball, takes a backswing, and hits the ball, sending it arcing high into the air.
(a12) A metal pipe smashes a pumpkin, causing its insides to spill out.
(a13) A baseball bat smashes a glass bottle, sending shards flying in all directions.
(a14) A timelapse captures the transformation as fog in a mountainous region comes into contact with a cold car windshield.
(a15) A volleyball player takes a hard swing, the ball contacting the palm and fingers, followed by a rapid downward motion sending the ball over the net.
(a16) A surfer dives underwater to avoid a breaking wave, resurfacing seconds later.
A weightlifter completes a snatch with a 25kg barbell, holding it momentarily overhead.
A tennis ball is gently placed on the surface of a bucket filled with water.
A person shovels snow from a garden path, leaving a clear path.