Neural Video Fields Editing



1 SECE, Peking University     2 Peng Cheng Laboratory

TL;DR: Editing long videos coherently via neural video fields.

Turn it into a watercolor painting

Turn it into a Monet style painting

Turn him into Obama

Turn the wolf into a brown bear

Turn him into the Hulk

Turn it into Cartoon style

Turn it into Van-Gogh style

Turn the dog into a tiger

Give him a mustache

Make it autumn

Have him ride a donkey

Put him in a black suit

Turn it into a ginger cat

Turn him into Martin Luther King

Turn it into a watercolor painting

Turn it into a Monet style painting

Turn him into Obama

Turn the wolf into a brown bear

Turn him into the Hulk

Turn it into Cartoon style

Turn it into Van-Gogh style

Turn the dog into a tiger

Give him a mustache

Make it autumn

Have him ride a donkey

Put him in a black suit

Turn it into a ginger cat

Turn him into Martin Luther King

Abstract

Diffusion models have revolutionized text-driven video editing. However, applying these methods to real-world editing encounters two significant challenges: (1) the rapid increase in graphics memory demand as the number of frames grows, and (2) the inter-frame inconsistency in edited videos. To this end, we propose NVEdit, a novel text-driven video editing framework designed to mitigate memory overhead and improve consistent editing for real-world long videos. Specifically, we construct a neural video field, powered by tri-plane and sparse grid, to enable encoding long videos with hundreds of frames in a memory-efficient manner. Next, we update the video field through off-the-shelf Text-to-Image (T2I) models to impart text-driven editing effects. A progressive optimization strategy is developed to preserve original temporal priors. Importantly, both the neural video field and T2I model are adaptable and replaceable, thus inspiring future research. Experiments demonstrate that our approach successfully edits hundreds of frames with impressive inter-frame consistency.

Pipeline

Left, Video Fitting Stage: we train a neural a video field to fit a given video for temporal priors.

Right, Field Editing Stage: The trained neural video field renders a frame, which is then edited by a pre-trained T2I model (e.g., Instruct-Pix2Pix+). We use it to optimize the trained field to impart editing effects.

Long Video Editing Results

Give him a mustache

Turn him into Albert Einstein

As a bronze bust

Make it snowed

Turn his shirt pink

Make it night

Make it autumn

Remove the ship

Make it sunset

Turn it into a Monet style painting

Turn him into Martin Luther King

Turn it into Cartoon style

Turn the wolf into a brown bear

Turn the wolf into a fox

Turn the wolf into a white fox

More Results (Short Videos)

Turn him into Obama

Turn the dog into a tiger

Turn it into Van-Gogh style

Turn him into the Hulk

Turn it into a ginger cat

Turn it into a Monet style painting

BibTeX

@article{yang2023nvedit,
  	title={Neural Video Fields Editing},
  	author={Shuzhou Yang and Chong Mou and Jiwen Yu and Yuhan Wang and Xiandong Meng and Jian Zhang},
  	journal={arXiv preprint arXiv:2312.08882},
  	year={2023}
}