Beyond Surface Statistics:
Scene Representations in a Latent Diffusion Model

Yida Chen

Fernanda Viegas

Martin Wattenberg

[Paper]

[GitHub]

Linear probes found strong representations of foreground and scene depth in a pretrained 2D Stable Diffusion model. These scene representations (2^nd row & 4^th row) emerge in the early denoising process — even though the input latents are still noisy, and the decoded images (1^st row) are not human understandable.

Abstract

Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process—well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output.

Linear internal representations of
foreground / background and depth

Did 2D image generative diffusion model learn the geometry inside its generated images? Can it "see" beyond the 2D matrix of pixels and distinguish the depth of objects in its synthesized scenes? The answer to these questions seem to be "Yes" given the evidence we found using linear probing.

Linear probes find dimensions of foreground and scene depth in the activation space of a latent diffusion model (LDM). A linear probe was trained to predict the foreground and scene depth using the neural network's intermediate representations of the input, namely, the intermediate activations.

In the figure above, the left sides of two columns display the images sampled from LDM. The right sides are the predictions of the linear probes that took the LDM's self-attention layer activations as input.

Optimization on Linear Probes: Finding Dimensions of Features

A linear probing classifier can be seen as a matrix of trainable weights. We apply this matrix to project the intermeidate activations of a diffusion model, and the results of projection are used as logits to predict the properties of the output image.

The optimization on the linear probe can be seen as searching a linear axis inside activation space that best aligns with the target property — foreground or scene depth in our case.

Linear probe is a common technique for interpreting the learned internal representations of a neural network.

Text-to-Image Generation with Intervention

The high accuracy of linear probing classifier indicates a high mutual information between the model's internal representations and the scene property of its generated images. However, whether this mutual dependency implies a causation between model's output and its internal representation or merely a strong correlation needs to be further tested.

How to test causality? We leveraged a rather simple setting. We shifted the activations of diffusion model on the feature axis identified by linear probes, and see how this translation on activation vectors affects the model's output.

We implemented this intervention as a new text-to-image generation method.

In the original image generation pipeline (upper part of the figure above), we train linear probes on the model's intermediate activations to predict the foreground of a generated image.

In the intervened image generation pipeline (lower part of the figure), we modify the intermediate activations using the projection learned by probe so a pixel's foreground and background property changes to match a new foreground map d^'_b. We made no changes on the model's weights, initial latent vectors, random seed, and prompt.

Ideally, if a causal link between the internal representation and scene geometry exists, changing the internal representation of that scene property should affect that property in the generated image accordingly.

Generate image sequences of moving objects using intervention

As shown in our paper, the intervention has causal effects on the model's output. The scene depth of an image can be rewritten by our intervention on model's intermediate activation. We can even generate a video of moving motorcycle by continuously translating the foreground representation on the 2D plane.

All intervened outputs are sampled from the same initial latent vector and prompt as the origina output.

More examples of videos featuring moving objects sampled from 2D Stable Diffusion model using our intervention technique.

	Prompt = Southern living container plants	Prompt = Elissa Leather Chain Strap Shoulder Bag...

Paper and Supplementary Material

Y. Chen, F. Viegas,
M. Wattenberg
Beyond Surface Statistics:
Scene Representations in a Latent Diffusion Model
(hosted on arXiv)

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.