Diffusion Models Demystified
The intuition behind the building blocks of Diffusion models
Diffusion models are generative models used across different deep learning domains. Currently, they are mainly used for image and audio generation. Most notably, these models are the driving force behind impressive image generation models such as Dalle2 and Stable diffusion. I’m sure you’ve seen the scintillating images generated by these models. The awe-inspiring results are simply a testimony to how exciting the progress in Deep Learning is.
What is Diffusion?
In physics, diffusion is simply the overall movement of anything; (atoms, energy) from a region of higher concentration to a region of lower concentration. Now imagine dropping a small drop of paint into a glass of water, the density of the paint will be concentrated in one location, but as time passes by, the drop will diffuse into the water till it reaches equilibrium. Now wouldn’t it be great if we could reverse this process? Unfortunately, this is not possible. But diffusion models try to fit a model with the end goal of reversing this process.
The Intuition behind Diffusion Models
We can summarize diffusion models no better than the authors of the Deep Unsupervised Learning using Nonequilibrium Thermodynamics:
The essential idea is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.
Diffusion models try to reproduce the diffusion process by iteratively adding noise to the original image. We keep adding noise till the image becomes pure noise. The noise is defined by a Markov chain of events. A Markov chain is a model of events in which each time step only depends on the previous time step. The Markov property is defined as the following:
P(Xₙ = iₙ | Xₙ₋₁)
So any sequence of random variables X₀, X₁, X₂, …, Xₙ that satisfy the condition above could be regarded as a Markov chain. This Markovian assumption is what makes learning the added noise tractable. After training a model to predict the noise at each time step, the model will be able to generate high-resolution images from a Gaussian noise input. To summarize: We keep adding noise to an image till we are left with nothing but pure noise. We then train a neural network to remove that noise. So diffusion models consist of two stages:
- Forward diffusion process
- Reverse diffusion process
Forward Diffusion Process
The forward diffusion process is the stage in which the structure of the data is destroyed. This is done by applying noise-sampled from a normal distribution- The final image will subsequently converge to pure noise z ~ N(0, 1). The amount of noise applied at each timestamp is not constant. A schedule is utilized to scale the mean and the variance. The original DDPM paper from OpenAI applied a linear schedule. But researchers, again, from OpenAI, found that this leads to many redundant diffusion steps. So in their Improved Denoising Diffusion Probabilistic Models paper, they implemented their own Cosine schedule.
The forward process is defined as q(xₜ|xₜ₋₁). This function simply adds noise at each time step t. The mathematical definition of the forward process is as the following: q(xₜ|xₜ₋₁) = N(xₜ; sqrt{1-βₜ}xₜ, βₜI). You might remember from your statistics classes that a normal distribution is parametrized by the mean and the variance. The sqrt{1-βₜ}xₜ is the mean. βₜI is the variance. The betas you see in this equation are simply values that range between 0–1 0<β₁<β₂<…<β_T<1; The betas aren’t constant and are regulated by a “variance schedule.” Normally, you expect to repeat this process for each time step t. Completing this process in just one step would save us a lot of compute. Let’s look at how it’s done. Firstly, let’s define αₜ = 1-βₜ. We then can define the cumulative product of all the alphas α⁻ₜ = ∏aₛ Now, using the reparameterization trick, we can rewrite the aforementioned formula as the following:
Using the alphas, we can rewrite it as:
As you guessed, we can now extend this to the previous time steps:
Using the product of all alphas, the final equation will take the following form:
Congrats! That was it for the forward diffusion process :)
Reverse Diffusion Process
It would have been quite nice to reverse the above process by calculating q(xₜ₋₁|xₜ). Unfortunately, this calculation requires every time step. We thus revert to a neural model that learns to approximate these conditional probabilities. In the reverse process, a neural network will predict the mean given the image. The neural network will look at the image and try to determine the distribution of the images where that image came from in the forward process.
Our Loss function for diffusion models is simply -log(pθ(x₀)). The problem is that diffusion models are latent variable models that take the following form:
As you can imagine, this form has no closed solution. The solution to this is to compute a variational lower bound. Please note that knowing the derivation of VAEs could help you understand the following formulas. The entire reverse process is defined as:
As this is a joint distribution, we have to multiply each reverse process. Remember that pθ(xₜ₋₁|xₜ) takes you from a more “noisy” image to a “less noisy” one. I mentioned the variational lower bound, but what is it? On a high level, let’s say we have a function f(x) which is intractable. If we can prove that we have a function g(x) that is smaller than f(x). Then by maximizing g(x), we can be certain that f(x) will also increase. Let’s compare the -log(pθ(x₀)) by adding the KL divergence to our original function f(x) = -log(pθ(x₀)).
Rewriting the KL divergence through Bayes’ theorem, we get the following:
So our variational lower bound becomes:
Our goal now is to convert the right-hand side to be analytically computable. Let’s start by rewriting the logs as products:
Using the product rule of logarithms, we can rewrite the right-hand side:
Taking out the first term of the summation gives us the following:
Rewriting q(xₜ|xₜ₋₁) using Bayes’ theorem and conditioning on the input image at t = 0:
Substituting:
Using the product rule:
The second summation can be simplified even further. Take T to be equal to any number, you’ll see that most of the terms cancel out, and you will be left with the following:
Substituting:
Using the Quotient rule, we can rewrite the last two terms:
You can see that the first and last terms cancel out. Tidying up our formula with another use of the Quotient rule:
We can now write the log terms in KL divergence:
The authors of the DDPM paper ignored the first term. As mentioned above, the term pθ(xₜ₋₁|xₜ) can be rewritten as a neural network that predicts the mean:
q(xₜ₋₁|xₜ, x₀) has a closed-form solution as mentioned before. We can write it as:
The authors went with a simple mean-squared error between the actual μ and the predicted μ. Using a definition that is beyond the scope of this blog post to prove, they arrived at the following:
Using the definition above, we can simplify the mean-squared error to be:
This is the term we take our gradient descent step on! All those simplifications and we reached the following conclusion: predict the noise. The final objective function takes the following form:
Congrats! You made it through the reverse process.
What is Stable Diffusion?
Stable diffusion is an open-source alternative to OpenAI’s Dalle.2. Since Stable Diffusion is a Latent Diffusion Model, I’ll try to give a high-level explanation of LDMs. Remember how the reverse diffusion process uses a neural network to reduce the noise gradually? Stable Diffusion uses a U-Net, which is a convolution-based neural net that downsamples an image to a lower dimension and reconstructs it during upsamples. Skip connections are added between the down and upsampling layers for a better gradient flow. The prompt is injected into the model by concatenating text embeddings generated from a language model to the image representation. Attention layers in the U-Net allow the model to attend to the text tokens through cross-attention.
LDMs, as the name suggests, don’t work on raw pixels. Instead, the image is encoded into a smaller space through an encoder. The image is then decoded back into its original space through a decoder. This allows the diffusion process to work on a small/latent space and completes the denoising in that space. You can think of it as an autoencoder enclosing the diffusion process. This is why it’s called Latent Diffusion; we’re implementing the diffusion process not in pixels but in the latent space. The following image should be enough to summarize LDMs:
Summary
- Diffusion models work by adding noise to an image iteratively, then training a neural network to learn the noise and restore the image.
- U-Nets are the most widely used neural network for the reverse process.
- Skip connections and attention layers are added to the U-Net for better performance.
- LDMs work by encoding the image to a smaller latent space and implementing the diffusion process in that space, the images are then restored through a decoder.
References
Papers:
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics
- Denoising Diffusion Probabilistic Models
- Improved Denoising Diffusion Probabilistic Models
Videos:
- Diffusion Models | Paper Explanation | Math Explained
- Diffusion models from scratch in PyTorch
- DDPM — Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)
- How does Stable Diffusion work? — Latent Diffusion Models EXPLAINED
Written Tutorials: