Optimal transport is becoming a hotly investigated theory in machine learning. What was previously a theory in mathematics with certain applications in economics is now finding many use cases in machine learning, such as image translation. A key concept in optimal transport theory in the Wasserstein distance, which is a metric used to determine a distance between differing datasets. In this post we will introduce Wasserstein distance and how it differs from more traditional measures. In the following post, we will introduce an efficient means of computating Wasserstein distances called the Sinkhorn metric.

Comparing Datasets

In many applications of machine learning, one might find it useful to compare two different datasets. For example, one might want to compare the distributions of training data to testing data. Or one might want to compare different training datasets for the purposes of transfer learning. Whatever the case, it is important to have a metric to determine how “different” (in terms of distribution) the datasets are to each other, assuming we don’t know their distributions beforehand.

An Information-theoretic Perspective

Traditionally the answer to the problem lay in information theory. Given two random variables \(X\) and \(Y\), we may treat \(X\) as data that’s already observed (i.e. training data) and \(Y\) as a model that aims to approximate \(X\) (i.e. testing data).

Shannon Entropy

From an information-theoretic perspective, a (discrete) random variable \(X\) can be thought of containing information, depending on what its probability distribution is like. This information can be encoded in bits of 0s and 1s.

Now we want to ask, is there a quantifiable difference between a fair coin (50/50 heads or tails) and a rigged coin? It turns out that the amount of information needed to encode the two types of coins is very different.

Suppose for the moment that you know the coin is fair. And someone flips the coin 10 times without your knowledge. You want to know how the flips landed. Out of these 2 flips we have a total of \(2^{10}\) possible outcomes of ordered sequences of heads or tails. Suppose the 2 flips were done without your knowledge, and you wanted to know exactly which outcomes occurred on the tosses. You must ask the tosser a series of yes/no questions in order to find out which coins were heads and which coins were tails. How many questions would you need to ask on average, in order to figure out the configurations of the coin tosses?

The first thing that may come into your mind is to ask sequentially:

  • Is the first coin heads?
  • Is the second coin heads?

Then we would need to ask 2 questions for any particular experiment, in order to know exactly how the 2 coin flips landed. In terms of information, we can encode this as two bits, the leftmost bit encodes the answer to the first question and the rightmost bit encodes the answer to the second question. E.g. the string 10 encodes the answers yes to the first question, and no to the second question. Since this uniquely determines the configuration, the string 10 corresponds to heads-tails.

Since the coin is fair, it means that every one of the \(2^2\) configurations is equally likely, since we needed to ask 2 questions for each, the average number of yes/no questions requires is 2. The average number of questions required to get a single coin flip is 1. That’s how much information we needed to figure out the outcomes of the coin tosses. So for a fair coin, we may say that it requires 1 bit of information to encode it’s distribution.

Now what if you knew that the coin was unfair? Let’s take a very extreme example, the coin is magnetized on one side and there’s a hidden magnet underneath the table so that it always comes up heads, and you know this. How many questions (on average) would we need to ask in this case?

The answer is 0. Because you already know that the coin will come up heads every time. So no questions required. Therefore a coin which always comes up heads requires 0 information to encode.

Now what about something in between? Suppose that \(P(Heads) = \frac{3}{4}\) and \(P(Tails) = \frac{1}{4}\).

It may still seem like we need \(n\) questions for \(n\) tosses to figure out the exact configuration. But it turns out that in this biased case, we can do better.

Consider the following series of questions we ask for a 2-toss experiment:

  1. Is there at least one tails? -> If answer is no, then we must have head-head.
  2. If 1. is yes, is the first coin tails? -> If answer is no, then we have head-tail (since we know there is at least one tail and it’s not the first coin).
  3. If 2. is yes, is the second coin heads? -> If answer is yes, then we have tail-head. If answer is no, then we must have tail-tail.

In terms of encoding, we encode each question as a bit once again. So for example, the string 110 represents: yes to question 1 and 2, and no to question 3. So 110 corresponds to the configuration tail-tail, for example.

So as we can see, here we use a strategy that is not uniform, in the sense that some configurations will lead us to only ask one question (head-head), whereas some others require 3 questions (tail-tail). Since we are interested in the average number of questions, this allows us to surpass the naive strategy of asking 2 sequential questions (which is 2 on average as well).

Why is this? Let’s look at the probabilities of each configuration and the number of corresponding questions:

  • Head-head: probability = \((\frac{3}{4})^2 = \frac{9}{16}\), questions = 1
  • Head-tail: probability = \(\frac{3}{4} \cdot \frac{1}{4} = \frac{3}{16}\), questions = 2
  • Tail-head: probability = \(\frac{1}{4} \cdot \frac{3}{4} = \frac{3}{16}\), questions = 3
  • Tail-tail: probability = \(\frac{1}{4} \cdot \frac{1}{4} = \frac{1}{16}\), questions = 3

Then the expected (or average) number of questions is given by

\[\frac{9}{16} \cdot 1 + \frac{3}{16} \cdot 2 + \frac{3}{16} \cdot 3 + \frac{1}{16} \cdot 3 = 1.6875\]

and since this gives us 2 tosses, the average number of questions per toss is 0.8435. Lower than the 1 bit (or question) required for a fair coin. In terms of information, we needed an average of 0.8435 bits to encode this probability distribution.

Now it’s easy to see why this strategy won’t work for a fair coin, since each configuration has equal probability, we would be taking the average of \(\{1, 2, 3, 3\}\) which gives us a number greater than 2. So for the fair coin, asking the value of each coin sequentially is actually the best we can do. This gives us a meaningful and quantitative way of distinguishing between coins of different biases. I.e. the fair coin needs more information to “figure out” or encode, and the more biased a coin is, the less information is needed to “figure it out”. So a fair coin contains more information content.

In the case of the biased coin, how can we be sure that we’ve asked the best set of questions (i.e. most efficient encoding)? Formally, Claude Shannon has proven an achievable lower bound for the amount of information needed. Given a discrete random variable \(X\), its Shannon entropy is given by:

\[H(X) = - \displaystyle\sum\limits_x p(x) \log_2 p(x)\]

The base in the logarithm indicates the unit of information we’re dealing with. Usually base 2 is chosen because we quantify information in terms of bits. A quick calculation shows that the Shannon entropy of a fair coin is 1, however the Shannon entropy of the biased coin we looked at from earlier is in fact around 0.81. Which means that in our example, we didn’t use the best possible encoding.

Kullback-Leibler Divergence

The Kullback-Leibler (KL) Divergence is a relative measure of how much information is lost when trying to approximate the known information \(P\) with the new model \(Q\) (defined on the same sample space). More precisely, it measures the number of extra bits required to code samples from \(P\) based on the new code \(Q\). Given densities \(p(x), q(x)\) for \(P, Q\) respectively. Its formula is as follows

\[KL(P||Q) = \displaystyle\sum\limits_x p(x) \log\left(\frac{p(x)}{q(x)}\right)\]

Intuitively, the log factor measures the difference in information between the two distributions, and therefore the sum becomes the expected difference in information between \(P\) and \(Q\).

To take the example of our two coins from before. Let \(P\) denote the random variable of the fair coin and \(Q\) denote the random variable of the coin with \(P(Heads) = \frac{3}{4}\) and \(P(Tails) = \frac{1}{4}\). The KL divergence from \(Q\) to \(P\) is given by

\[KL(P||Q) = \frac{1}{2} \log\left(\frac{1/2}{3/4}\right) + \frac{1}{2} \log\left(\frac{1/2}{1/4}\right) \approx 0.143\]

We interpret this as saying – if we have an optimized set of questions which tells us configurations of the biased coin. We would need an extra 0.143 questions (on average) per coin in order to figure out the configurations of the fair coin. This makes sense since the fair coin holds more information, as we’ve already seen.

Caveats

Now this seems like a perfect way to measure differences between distributions. For example, one might have a training set and wonder how much extra information is needed to classify the testing set, knowing an optimal function to classify the training set. However, there’s one huge caveat: KL divergence is not a distance function!

To see this, note that for any reasonable definition of distance, going from a point A to a point B should yield the same distance as going back from B to A. However KL divergence doesn’t have this property! Lets take the two coins in the previous section, now we want to compute the KL divergence from \(P\) to \(Q\):

\[KL(Q||P) = \frac{3}{4} \log\left(\frac{3/4}{1/2}\right) + \frac{1}{4} \log\left(\frac{1/4}{1/2}\right) \approx 0.203 \neq KL(P||Q)\]

So given our sequential strategy of asking one coin at a time for the fair coin, if we applied the same strategy to the biased coin, it would require on average additional questions! This happens even though the biased coin contains less information, because of course, the optimal encoding for one distribution might not be the optimal encoding for the other, and vice-versa.

Another property that the KL divergence does not satisfy is the triangle inequality, formally it takes the form

\[d(x, z) \le d(x, y) + d(y, z)\]

for points \(x, y, z\) in the same space and a distance function \(d\). Intuitively, it says that going from a point A to another point B is always at least as short as going to an intermediate point in between.

To show an example of this failure, consider again the fair and biased coins \(P, Q\) from before, with an additional biased coin \(R\) with \(P(Heads) = \frac{9}{10}\) and \(P(Tails) = \frac{1}{10}\). We leave you to show that \(KL(P||Q) > KL(P||R) + KL(R||Q)\).

This somehow implies that switching directly to a new dataset requires more information (e.g. extra training) than going to some third dataset, and then switching to the new dataset, which seems paradoxical.

Costs of Moving

Sometimes people will use the Jansen-Shannon divergence as a symmetrized version of the KL divergence. Unfortunately this only takes care of the symmetry property but not the triangle inequality property (in general).

More recently, machine learning researchers have started to think about this problem of comparing distributions using a different perspective: rather than think about some notion of intrinsic difference between two distributions, instead think about how much it would cost (this could be computing power, physical energy, monetary cost, etc.) to move the data from one distribution to the other.

In a way this is even more useful than KL divergence. Even though KL divergence tells us how much information it takes to move from one distribution to another, it’s agnostic on how much each bit of information costs. In real-world applications, we care very much about costs, and some forms of information may be cheaper than others. As we will see shortly, this interpretation involving costs of moving also gives us a bona fide distance function to work with.

Wasserstein Distance

The original motivation behind Wasserstein distance, and indeed the field of optimal transport, comes from transporting goods and resources over distances. For example, a copper mine is located some distance away from the factories, and manufacturers often needed to calculate the most cost-efficient way to transport raw copper to the factories for processing. This is often a hard problem because there are many paths to choose from (different roads have different costs) and also because there are multiple mines and factories (which mine should transport to which factory).

By this analogy, given distributions \(P\) and \(Q\), we can think of samples in \(P\) as locations of the mines and samples in \(Q\) as the locations of the factories. Then what we want to do is to calculate the most cost-efficient way to move the copper (samples of \(P\)) to the factories (samples of \(Q\)). For now let’s declare the cost of moving from point \(x\) to point \(y\) the Euclidean distance between them: \(||x - y||\).

Now let us set up the definition of Wasserstein distance. Suppose we have two random variables \(X\) and \(Y\) jointly distributed on a probability space \(\mathcal{X}\), such that the marginal densities of \(X\) and \(Y\) are given by \(\mu, \nu\) respectively. Moreover suppose we have a distance function \(d(x, y)\).

Then we define the Wasserstein-p distance as:

\[W_p(\mu, \nu) = \left( \inf\limits_{\pi \in \Pi(\mu, \nu)} \displaystyle\int_{\mathcal{X}} ||x - y||^p d\pi(x, y) \right)^{1/p}\]

where the infimum is taken over all possible joint distributions \(\pi\) of \(X, Y\) with marginals \(\mu\) and \(\nu\).

Above is the general definition for continuous random variables, but since we are mostly concerned with doing computations and optimizations, we are more concerned with the discrete case. In the discrete setting, the probability space is finite with size \(d = |\mathcal{X}|\). The distance function \(d(x, y)\) becomes a distance matrix \(D(x, y)\) of size \(d \times d\), and the joint distribution also becomes a matrix \(P(x, y)\) of the same size. Here the marginal distributions become the vectors \(\mu = P \cdot 1_d, \nu = P^T 1_d\), the row and column sums of the joint distribution matrix.

So the discrete Wasserstein-p distance is given by:

\[W_p(\mu, \nu) = \left( \min\limits_{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d} \langle D(x, y)^{(p)}, P(x, y)\rangle_F \right)^{1/p}\]

where \(\langle \cdot, \cdot \rangle_F\) denotes the Frobenius inner product for matrices, and \(D(x, y)^{(p)}\) denotes the elementwise \(p\)-th power of \(D(x, y)\).

It’s easy to see that under both definitions, the Wasserstein-p distance is symmetrical. The proof of triangle inequality is more involved, see 1.

Conclusion

In this post we have summarized the long-standing topic of calculating differences between distributions. We have presented an information-theoretic perspective and defined KL divergence and Wasserstein(-p) distance. In the next post we will cover a practical way of computing these quantities.


  1. Cedric Villani: Optimal Transport: Old and New (2008)