A Latent Space of Stochastic Diffusion Models
for Zero-Shot Image Editing and Guidance

Chen Henry Wu         Fernando de la Torre
Robotics Institute, Carnegie Mellon University

ICCV 2023

Abstract

Diffusion models generate images by iterative denoising. Recent work has shown that by making the denoising process deterministic, one can encode real images into latent codes of the same size, which can be used for image editing. This paper explores the possibility of defining a latent space even when the denoising process remains stochastic. Recall that, in stochastic diffusion models, Gaussian noises are added in each denoising step, and we can concatenate all the noises to form a latent code. This results in a latent space of much higher dimensionality than the original image. We demonstrate that this latent space of stochastic diffusion models can be used in the same way as that of deterministic diffusion models in two applications. First, we propose CycleDiffusion, a method for zero-shot and unpaired image editing using stochastic diffusion models, which improves the performance over its deterministic counterpart. Second, we demonstrate unified, plug-and-play guidance in the latent spaces of deterministic and stochastic diffusion models.

We define a new latent space for stochastic diffusion probabilistic models. This latent space allows us to perform: (1) Unpaired image-to-image translation with two diffusion models pre-trained independently (e.g., cat and dog). In this case, we can transfer the texture characteristics from an image of a cat to the model of a dog in an unsupervised fashion. (2) zero-shot image editing with a pre-trained text-to-image diffusion model. In this case, we can edit images with text prompts. (3) Plug-and-play guidance of a pre-trained diffusion model with off-the-shelf image understanding models such as CLIP. In this case, we are able to sub-sample a generative model of faces guided by attributes like "eyeglasses" or "old".

Method

We concatenate all Gaussian noises in the denoising process as the latent code of stochastic diffusion models. This definition is straightforward, but we show that it allows us to perform zero-shot image editing and guidance for stochastic diffusion models.


Based on the above definition of latent space, we propose CycleDiffusion, a method for zero-shot and unpaired image editing using stochastic diffusion models.

CycleDiffusion for Zero-Shot Image Editing

CycleDiffusion is capable of zero-shot image editing with a pre-trained text-to-image diffusion model (e.g., Stable Diffusion). The editing is specified by a source prompt and a target prompt.


CycleDiffusion is also compatible with attention manipulation techniques such as Cross Attention Control (CAC). With CAC, CycleDiffusion can better preserve the structure of the input image during editing.

CycleDiffusion for Unpaired Image Editing

CycleDiffusion is capable of unpaired image editing with a two pre-trained diffusion model (e.g., a dog diffusion model and a cat diffusion model).

Guiding stochastic diffusion models as StyleGANs

Definig a latent space for stochastic diffusion models allows us to guide them in the same way as guiding StyleGANs. For instance, we can guide a pre-trained diffusion model with off-the-shelf image understanding models such as CLIP. In this case, we are able to sub-sample a generative model of faces guided by attributes like "eyeglasses".

BibTex


    @inproceedings{cyclediffusion,
      title={A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance},
      author={Chen Henry Wu and Fernando De la Torre},
      booktitle={ICCV},
      year={2023},
    }