Wasserstein Distance and Sinkhorn Metric (Part II)

Recall that in the last post, we gave the discrete version of the Wasserstein-p distance:

\[W_p(\mu, \nu) = \left( \min\limits_{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d} \langle D(x, y)^{(p)}, P(x, y)\rangle_F \right)^{1/p}\]

For convenience, we denote the set of all joint distribution matrices with marginals \(\mu, \nu\) as

\(U(\mu, \nu) = \{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d\}\).

In fact this space is a convex polytope and we call it the Transport Polytope.

Information Theory Revisited

You might also recall our discussion of Shannon entropy and KL divergence from the last post. I might have given the impression that these concepts have been displaced by the new exciting world of optimal transport, but as it turns out, they are very important for computing the Sinkhorn metric!

As we can readily gather from the problem setting, there can potentially be a lot of joint distributions with marginals \(\mu, \nu\), even in the discrete case. However, we can get a constraint on exactly which joint distributions are possible through computing its Shannon entropy.

Theorem¹: \(H(P) \le H(\mu) + H(\nu)\), and the inequality is sharp.

In other words, the Shannon entropy of the joint distribution is always bounded by the sum of the Shannon entropies of the marginal distributions. Moreover, this bound is sharp since we can always construct the independent joint distribution matrix \(\mu\nu^T\).

With this constraint in mind, we define a subset of the transport polytope:

\[U_{\alpha}(\mu, \nu) = \{ P \in U(\mu, \nu) | KL(P || \mu\nu^T) \le \alpha\}\]

Intuitively, this subset contains all joint distributions whose Shannon entropy is less than \(\alpha\) away from the maximum possible entropy of \(H(\mu) + H(\nu)\) (which is attained by the joint distribution \(\mu\nu^T\)). I.e. all joint distributions which have sufficient entropy with respect to \(H(\mu), H(\nu)\).

Sinkhorn Metric

Now we are ready to define the Sinkhorn Metric:

\[d_{D, \alpha}(\mu, \nu) = \min\limits_{P(x, y) \in U_{\alpha}(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\]

We see that this distance is similar to the Wasserstein-1 distance except we are only looking for the minimum distance among joint distributions with sufficient entropy with respect to \(H(\mu), H(\nu)\).

Why only consider this restricted subset of joint distributions? The authors of ² explain that in solutions to the optimal transport problem, the optimum joint distributions are often on the corners of the transport polytope, and thus are statistical outliers. By setting this entropic restriction, the space of joint distributions is smoothened, which allows for easier computation.

Special cases

In one extreme, let us taken \(\alpha\) to infinity. Then as \(\alpha\) grows larger, the set of allowable entropies gets larger, so \(U_{\alpha}(\mu, \nu)\) gets larger until it converges to \(U(\mu, \nu)\) proper. So we can see that \(U_{\alpha}(\mu, \nu)\) is a good approximation of the transport polytope with large enough \(\alpha\).

In the other extreme, if we set \(\alpha = 0\), we can solve the optimal transport problem in closed form given any distance matrix. In fact we have:

Theorem: \(d_{D, 0} = \mu^T D \nu\).

Beyond these special cases, the author proves the symmetry and triangle inequality properties. The proof is similar to C. Villani’s proof in ³.

Duality

In many optimization problems, solving a problem amounts to solving its dual problem. Perhaps the most popular example is with Lagrange multipliers. Given a (constrained) optimization problem

\[p^* = \min\limits_x f_0(x): f_i(x) \le c_i, i = 1, \cdots, m\]

its dual problem turns the problem into an unconstrained one, i.e.

\[d^* = \max\limits_{\lambda \ge 0 } \min\limits_x \mathcal{L}(x, \lambda)\]

where \(\mathcal{L}(x, \lambda) = f_0(x) + \displaystyle\sum\limits_{i=1}^m \lambda_i f_i(x)\) is called the Lagrangian and the parameters \(\lambda_i\) are called Lagrange multipliers.

It may seem at first glance that the optimal transport problem \(\min\limits_{P(x, y) \in U(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\) lacks constraints, but by considering entropy, we do have a hard constraint, namely that \(H(P) \le H(\mu) + H(\nu)\)! Therefore we can transform the optimal transport problem into the dual problem

\[\min\limits_{P(x, y) \in U(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F - \frac{1}{\lambda} H(P)\]

(the author uses a different convention of putting the Lagrange multiplier in the denominator, this does not change the optimization problem)

Notice that by introducing the sufficient entropy bound \(\alpha\), we get a dual problem for each value of \(\alpha\) and indeed a corresponding Lagrange multipler for each. Therefore we can also dualize the Sinkhorn distance \(\min\limits_{P(x, y) \in U_{\alpha}(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\).

It turns out that computing the dual problem is computationally simpler.

Algorithm

The algorithm starts with computing the matrix \(K = \exp(-\lambda D)\), which is the elementwise exponential of the distance matrix \(D\). Then it uses the Sinkhorn fixed point iteration algorithm⁴ to compute the dual joint distribution matrix \(P_{\lambda}\). Finally, after some convergence threshold, the final iteration of the matrix is used to compute the Sinkhorn distance by taking a Frobenius inner product with \(D\).

Below is pseudocode of the algorithm (all dots before operators indicate elementwise operations):

function dual_sinkhorn(D, lambda, mu, nu){
    K = .exp(-\lambda D) #Elementwise exponentation of the matrix
    u = ones(size(mu))
    K_tilde = diag(1. / r) * K  
    while u changes more than some threshold:
        u = 1. / K_tilde * (nu ./ (K_tilde.transpose() * u)) #Sinkhorn fixed-point iteration
    v = (nu ./ (K_tilde.transpose() * u)
    d = sum(u .* ((K .* D)v)) #Frobenius inner product of K and D
}

The author does in fact extend the algorithm to simultaneously calculate sinkhorn distances for \(N\) target distributions simultaneously. But this is just a matter of concatenating vectors and will be omitted in this post.

Computation Package

Now that we’ve explained and summarized the approach. You can implement this yourself in your favorite language/computational software. However I have found a very good python package that already has Sinkhorn metrics built in. The package is Geomloss written by machine learning researcher Jean Feydy. This project is open source and available on Github. It is also compatible with Pytorch tensors. I highly recommend it if you need to compute Wasserstein distances.

Some example code you can use with this package (after installation):

from geomloss import SamplesLoss

torch.random.manual_seed(0)
X = torch.rand((3,100))
Y = torch.rand((3,100))

Loss =  SamplesLoss("sinkhorn", blur=0.05)
print(Loss( X, Y ).item())

Cover, Thomas: Elements of Information Theory (1991) ↩
Marco Cuturi: Sinkhorn Distances: Lightspeed Computation of Optimal Transport (2013) ↩
Cedric Villani: Optimal Transport: Old and New (2008) ↩
Kristen Anderson, Ashley Burt, Will Cousins, Brent Hancock, David Strong: A Sinkhorn-Knopp Fixed Point Problem (2010) ↩