OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

1 Johns Hopkins University   2 Adobe   3 HKU   4 CUHK   5 Shanghai Jiao Tong University  

A data construction pipeline and a DiT method for controllable subject-driven video customization


Method Overview

OmniVCus can compose different input signals to customize a video. We design: (a) Lottery Embedding (LE) enables more-subject customization in inference by activating more frame embeddings with training subjects. (b) Temporally Aligned Embedding (TAE) extracts the guidance from control signals by aligning the frame embeddings of condition and noise tokens.





(a) Main Subject-driven Video Customization Results

(a1) Single-subject Video Customization

Thumbnail

The woman in IMG1 is talking to a man on a street

Thumbnail

A giraffe from IMG1 is standing in the winter


Thumbnail

A dog is playing with a yellow toy car of IMG1 in a living room

Thumbnail

A girl putting on a pair of glasses that turn into the effect of IMG1


Thumbnail

A boy of IMG1 preying in a temple

Thumbnail
Thumbnail

A woman of IMG1 mopping the floor in a living room




(a2) Instructive Editing Subject-driven Video Customization

(The instructive editing texts are in purple color)
Thumbnail

(Style Transfer)

Make the stone cross as IMG1 on a stone pavement a sketch

Thumbnail

(Color Change)

Turn the color of the toy car in IMG1 to red playing with a dog


Thumbnail

(Removal)

Remove the blazer of the man in IMG1 standing in a summer park

Thumbnail

(Replacement)

Swap the T-shirt of the boy in IMG1 with a blue shirt preying in a temple


Thumbnail

(Addition)

Add a necklace to the woman in IMG1 running in the winter

Thumbnail

(Expression Change)

Make the child in IMG1 happy doing her homework




(a3) Double-subject Video Customization

Thumbnail 1
IMG1
Thumbnail 2
IMG2

A man of IMG1 running in front the church of IMG2

Thumbnail 1
IMG1
Thumbnail 2
IMG2

A man pushing a wheelbarrow as IMG1 with oranges as IMG2 in an orchard


Thumbnail 1
IMG1
Thumbnail 2
IMG2

Turn the color of the sweater of the man in IMG1 to gray drinking a cup of coffee as IMG2

Thumbnail 1
IMG1
Thumbnail 2
IMG2

Swap the skirt of the woman as IMG1 to a gray legging and a woman as IMG2 running in a park


Thumbnail 1
IMG1
Thumbnail 2
IMG2

A man using a knife as IMG1 to cut an orange as IMG2 in the kitchen

Thumbnail 1
IMG1
Thumbnail 2
IMG2

Tomoatoes as IMG1 and sushi as IMG2 on a table


Thumbnail 1
IMG1
Thumbnail 2
IMG2

The man as IMG1 using a laptop as IMG2 in a library

Thumbnail 1
IMG1
Thumbnail 2
IMG2

Add a black coat to the man as IMG1 using a laptop as IMG2 in a library


Thumbnail 1
IMG1
Thumbnail 2
IMG2

A dog as IMG1 playing with the yellow toy car as IMG2 in a living room

Thumbnail 1
IMG1
Thumbnail 2
IMG2

A dog as IMG1 playing with the yellow toy car as IMG2 in a plaza




(a4) Zero-shot More-subject Video Customization

(Our model is trained with 2 subjects but can infer with 3 and 4 subjects)
Thumbnail 1
IMG1
Thumbnail 2
IMG2
Thumbnail 3
IMG3

A man pushing a wheelbarrow as IMG1 with oranges as IMG2 and grapes as IMG3 in an orchard

Thumbnail 1
IMG1
Thumbnail 2
IMG2
Thumbnail 2
IMG3
Thumbnail 2
IMG4

Turn the color of the T-shirt of the girl as IMG1 to light gray using a laptop as IMG2 at a table with a cup as IMG3 and a tomato as IMG4


Thumbnail 1
IMG1
Thumbnail 2
IMG2
Thumbnail 2
IMG3
Thumbnail 2
IMG4

A dog of IMG1 running by a stone cross of IMG3 and purple flowers of IMG4 in front of the church of IMG2

Thumbnail 1
IMG1
Thumbnail 2
IMG2
Thumbnail 2
IMG3
Thumbnail 2
IMG4

A sponge as IMG2, a cucumber in IMG3, and an orange as IMG4 in the sink as IMG1 with water flowing





(a5) Camera-controlled Subject-driven Video Customization

Thumbnail

A church as IMG1 beside a cemetery in the winter

Thumbnail

A woman in a mask as IMG1 holding a book and sitting on a couch in a living room

Thumbnail

Thumbnail

A bridge like IMG1 by the ocean

Thumbnail

Tie the hair of the child as IMG1 doing her homework at the desk




(a6) Depth-controlled Subject-driven Video Customization

(Our model can force unmatched subjects to be filled into the depth control sequence, e.g., example 1 is man-to-woman, example 2 is woman-to-man, example 4 is orange-to-strawberry. These cases are more challenging and impressive.)
Thumbnail 1
IMG1
Input Depth

A man as IMG1 standing in front of an old airplane

Thumbnail 1
IMG1
Input Depth

Replace the shirt of the woman as IMG1 with a suit talking to her colleagues in an office


Thumbnail 1
IMG1
Input Depth

A boy as IMG1 looking into an open refrigerator with a tomato and a bottle of beer

Thumbnail 1
IMG1
Input Depth

An orange from IMG1 being dipped into melted chocolate




(a7) Mask-controlled Subject-driven Video Customization

(1) When the control mask is not complete or accurate, our model can still fill the subject and customize the video, e.g., the T-shirt of the boy in example 1.
(2) Our model can force unmatched subjects to be filled into the mask control sequence, e.g., example 3 fills a dog to the mask of a leopard and example 4 fills a church to the mask of a stone house.
Thumbnail 1
IMG1
Input Depth

A boy as IMG1 playing table tennis in a room

Thumbnail 1
IMG1
Input Depth

The woman as IMG1 doing yoga in a living room


Thumbnail 1
IMG1
Input Depth

A dog as IMG1 is walking in a living room

Thumbnail 1
IMG1
Input Depth

A church as IMG1 in a snowy day




(b) Challenging Cases: Composing Different Control Conditions

Our model can flexibly compose different control conditions including images of subjects, text instructions, depth sequence, mask sequence, and camera trajectory to customize the video.
Thumbnail 1
Multiple Subjects
Input Mask

Swap the skirt of the woman as IMG1 with the legging as IMG2 doing yoga in a living room

Thumbnail
Multiple Subjects + Camera

A woman as IMG2 standing by the flower as IMG1 on a street


Thumbnail 1
Multiple Subjects
Input Depth

Replace the clothes of the woman as IMG1 with the T-shirt as IMG2 looking into an open refrigerator with a tomato and a cucumber as IMG3

Thumbnail 1
Multiple Subjects
Input Depth

A woman as IMG1 looking into an open refrigerator with an orange as IMG2 and a cucumber as IMG3




(c) Model Emergence Capability: Text-to-4D

Our model mixed trained with customization data (dynamic) and text-to-3D data (static) can perform text-to-4D generation. Please note that, the 4D here refers to dynamic multi-view. Our model does not directly contain 4D representation.

A baby dragon hatching out of a stone egg

A building on fire

A bulldozer clearing away a pile of snow



A space shuttle launching

A clown fish swimming through the coral reef

A Darth Vader with a flame thrower



An elephant jumping on a trampoline

A wizard raccoon casting a spell

A steaming basket of buns


(d) Comparison Example with State-of-the-Art Methods

Thumbnail

The woman in IMG1 is talking to a man on a street

Our Method

Thumbnail

The woman in IMG1 is talking to a man on a street

Thumbnail

SkyReels-A2

Thumbnail

The woman in IMG1 is talking to a man on a street

Thumbnail

OmniGen + Wan2.1-I2V

Back to top