A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance

Abstract

Diffusion models generate images by iterative denoising. Recent work has shown that by making the denoising process deterministic, one can encode real images into latent codes of the same size, which can be used for image editing. This paper explores the possibility of defining a latent space even when the denoising process remains stochastic. Recall that, in stochastic diffusion models, Gaussian noises are added in each denoising step, and we can concatenate all the noises to form a latent code. This results in a latent space of much higher dimensionality than the original image. We demonstrate that this latent space of stochastic diffusion models can be used in the same way as that of deterministic diffusion models in two applications. First, we propose CycleDiffusion, a method for zero-shot and unpaired image editing using stochastic diffusion models, which improves the performance over its deterministic counterpart. Second, we demonstrate unified, plug-and-play guidance in the latent spaces of deterministic and stochastic diffusion models.

Method

We concatenate all Gaussian noises in the denoising process as the latent code of stochastic diffusion models. This definition is straightforward, but we show that it allows us to perform zero-shot image editing and guidance for stochastic diffusion models.

Based on the above definition of latent space, we propose CycleDiffusion, a method for zero-shot and unpaired image editing using stochastic diffusion models.

CycleDiffusion for Zero-Shot Image Editing

CycleDiffusion is capable of zero-shot image editing with a pre-trained text-to-image diffusion model (e.g., Stable Diffusion). The editing is specified by a source prompt and a target prompt.

CycleDiffusion is also compatible with attention manipulation techniques such as Cross Attention Control (CAC). With CAC, CycleDiffusion can better preserve the structure of the input image during editing.

CycleDiffusion for Unpaired Image Editing

CycleDiffusion is capable of unpaired image editing with a two pre-trained diffusion model (e.g., a dog diffusion model and a cat diffusion model).

Guiding stochastic diffusion models as StyleGANs

Definig a latent space for stochastic diffusion models allows us to guide them in the same way as guiding StyleGANs. For instance, we can guide a pre-trained diffusion model with off-the-shelf image understanding models such as CLIP. In this case, we are able to sub-sample a generative model of faces guided by attributes like "eyeglasses".

BibTex


    @inproceedings{cyclediffusion,
      title={A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance},
      author={Chen Henry Wu and Fernando De la Torre},
      booktitle={ICCV},
      year={2023},
    }