Direct Inversion : Boosting Diffusion-based Editing with 3 Lines of Code

1The Chinese University of Hong Kong 2International Digital Economy Academy
A gold plated bowl filled with fruit → A gold plated bowl filled with candy

We propose Direct Inversion, a simple yet potent inversion solution for diffusion-based editing. The essence of Direct Inversion lies in two primary strategies: (1) disentangle the source and target branches, and (2) empower each branch to excel in its designated role: preservation or editing. You can drag the white line to see images before and after editing. Left: Null-Text Inversion+Prompt2Prompt. Right: Direct Inversion+Prompt2Prompt.

Video

Abstract

Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt.

Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

Compare Direction Inversion with Previous Works

Comparisons among different inversion methods in diffusion-based editing. We assume a 2-step diffusion process for illustration. Due to nonexistent of ideal z2, common practice uses DDIM Inversion to approximate zt, resulting in zt* with perturbation. Diffusion-based editing methods start from the perturbed noisy latent z2* and perform DDIM sampling in a source and a target diffusion branch, further resulting in the distance shown on the figure. Null-Text Inversion and StyleDiffusion optimize a specific latent used in both source and target branches to reduce this distance. Negative-Prompt Inversion assigns the guidance scale to 1 to decrease the distance. In contrast, Direct Inversion disentangles source and target branches in editing. By leaving the target diffusion branch untouched, Direct Inversion retains the edit fidelity. By directly returning the source branch to z0, Direct Inversion achieves the best possible essential content preservation.

PIE-Bench

PIE-Bench comprises 700 images in natural and artificial scenes (e.g., paintings) featuring ten distinct editing types as shown in the figure. Each image in PIE-Bench includes five annotations: a source image prompt, a target image prompt, an editing instruction, edit subjects describing the main editing body, and the editing mask.

Quantitative Results

Compare Direct Inversion with other inversion techniques across various editing methods. For editing method Prompt-to-Prompt (P2P), we compare four different inversion methods: DDIM Inversion (DDIM), Null-Text Inversion (NT), Negative-Prompt Inversion (NP), and StyleDiffusion (StyleD). For editing methods MasaCtrl, Pix2Pix-Zero (P2P-Zero), Plug-and-Play (PnP), we compare with DDIM Inversion (DDIM) following their original setting.

Qualitative Results

Plug-and-Play Feature

Performance enhancement of incorporating Direct Inversion into four diffusion-based editing methods across various editing categories (from top to bottom): style transfer, object replacement, and color change. The editing prompt is displayed at the top of each row, which includes (a) the source image, the editing results of (b) Prompt-to-Prompt (P2P), (c) MasaCtrl, (d) pix2pix-zero, and (e) plug-and-play. Each set of results is presented: the first column w/o Direct Inversion (Null-text inversion for P2P, DDIM Inversion for the others), and the second column w/ Direct Inversion. Incorporating Direct Inversion into diffusion-based editing methods results in improved image structure preservation (enhancement of the structure distance metric) for full image editing and enhanced background preservation (increased PSNR metric values in the background, i.e., areas that should remain unedited) for foreground editing. The improvements are mostly tangible, and we circle some of the subtle discrepancies w/o Direct Inversion in red.


Compare with Different Inversion Techniques

Visulization results of different inversion and editing techniques. The source image is shown in col (a). We compare (h) Direct Inversion with different inversion techniques added with Prompt-to-Prompt: (b) DDIM Inversion, (c) Null-Text Inversion, (d) Negative-Prompt Inversion, and (e) StyleDiffusion. We also compare model-based editing results: (f) Instruct-Pix2Pix and (g) Blended Latent Diffusion. The improvements are mostly tangible, and we circle some of the subtle discrepancies w/o Direct Inversion in red.

BibTeX

@article{ju2023direct,
  author    = {Ju, Xuan and Zeng, Ailing and Bian, Yuxuan and Liu, Shaoteng and Xu, Qiang},
  title     = {Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code},
  journal   = {arXiv preprint arXiv:2304.04269},
  year      = {2023},
}