DAPE: Dual‑Stage Parameter‑Efficient Fine‑Tuning for Consistent Video Editing with Diffusion Models

Junhao Xia1,*, Chaoyang Zhang1, Yecheng Zhang1, Chengyang Zhou2, Zhichang Wang3, Bochun Liu1, Dongshuo Yin1,†

1Tsinghua University, Beijing, China   2Duke University, NC, USA   3Peking University, Shenzhen, China

Corresponding Author

Abstract

Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training‑required and training‑free methods. While training‑based methods incur high computational costs, training‑free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high‑quality yet cost‑effective two‑stage parameter‑efficient fine‑tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm‑tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision‑friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and six editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU‑TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text‑video alignment while outperforming previous state‑of‑the‑art approaches.

BibTeX

@article{xia2025dape,
  title={DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models},
  author={Xia, Junhao and Zhang, Chaoyang and Zhou, Chengyang and Wang, Zhichang and Liu, Bochun and Yin, Dongshuo},
  journal={arXiv preprint arXiv:2405.00000},
  year={2025}
}