Kullback–Leibler (KL) Divergence and Cross-Entropy

Explaining the derivation and giving an intuitive sense of what they constitute

5 min readMar 4, 2023

The Kullback-Leibler divergence is a measure of the difference between two probability distributions. You’ve probably seen this concept many times over in the field of Machine learning. Most notably, this concept is heavily relied upon when deriving the evidence lower bound (ELBO), which appears in variational autoencoders (VAEs), and diffusion models.

Derivation

To understand this concept intuitively, I’ll first construct a simple experiment that involves independent but identical stages, i.e flipping a coin n times. Since we will have 2 outcomes, our investigation will follow a sequence of independent Bernoulli trials. Our case will be an n-long sequence of heads and tails. Using the multiplication rule we can easily obtain that the probability will be: pᵏqⁿ⁻ᵏ Let us now assume that our task will be to compare the distributions between two coins, one is a fair coin, the other biased. We will denote the first distribution as p, and the second as q. If the two distributions are similar, then sampling n times from both distributions would probably yield similar-looking sequences. A more quantitative definition of this would be to minimize the difference between these two distributions. Formally: log(p(x)) — log(q(x)), if the result of this subtraction is close to zero, then the distributions should be similar. Hey! what’s with the logs? Since these probabilities will mostly be small numbers and multiplying them together will result in even smaller numbers, we simply take the log(mathematical convenience at play here). Now using the quotient rule we can rewrite this as:

As you can see, I’ve parametrized p and q to indicate that they could come from different types of distributions. For example, one could come from a gaussian, the other a binomial distribution. The formula above is known as the log-likelihood ratio. Since we are interested in the expected value of this log-likelihood ratio, we should convert it to represent a weighted average. Here is a quick recap of what the expected value is(please note that I’ll only be working with discrete random variables)

h(xᵢ) represents the function of the random variable(state) and pθ the weight. Using the expected value formula, we can easily convert our log-likelihood ratio to the expected value of it.

Putting things into place, we can rewrite the KL divergence as

Simply put, we took the difference the difference between two distributions, converted them to a log ratio, applied the expected value definition, and voilà! But, this derivation seemed a bit too abstract for my taste, let’s look at a simpler, and in my opinion, more intuitive one.

Remember the sequence of coin tosses? formulating two coins will result in the following:

p₁ᵏp₂ⁿ⁻ᵏ -> Coin 1 … q₁ᵏq₂ⁿ⁻ᵏ -> Coin 2 Let’s take the ratio between these two probabilities, if Coin 2 follows the same distribution as Coin1, the ratio between them should be close to 1.

Taking the log of this expression and applying log rules, we get the following:

As the observations go to infinity, we expect k/N and n-k/N to approximate p₁ and p₂, respectively. Rewriting the terms, we get the following:

The KL divergence is the general form of the normalized log ratio when there are multiple classes(set of values). So all we did here was to simplify the log ratio between observations from two different distributions. And as you can see, the simplified form resembles the KL divergence formula. Generalizing the above formula:

Cross Entropy Loss

Cross entropy loss is a key measure in many machine learning models and can be used to calculate the accuracy of predictions made by those models. It is especially useful for tasks that involve multi-class classification, such as image recognition, and natural language processing. By calculating the difference between what the model predicted and what it should predict according to labels, cross-entropy loss provides a reliable way to determine how well a model has learned. This can help identify areas where changes need to be made to improve model performance. The general form takes the following form:

Using what we have learned above, let’s design our machine-learning model to output a probability distribution. So we have the input image xᵢ, and the predicted class distribution would be: p(y|xᵢ;θ) We have the true class distribution as q(y|xᵢ). We can use KL divergence to look at the distance between these two probability distributions.