<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="http://oxhidingacne.com/feed.xml" rel="self" type="application/atom+xml" /><link href="http://oxhidingacne.com/" rel="alternate" type="text/html" /><updated>2022-06-19T17:05:30-04:00</updated><id>http://oxhidingacne.com/feed.xml</id><title type="html">Jack Ding’s Blog</title><subtitle>Machine Learning, Music, Math, and Video Games</subtitle><entry><title type="html">Wasserstein Distance and Sinkhorn Metric (Part II)</title><link href="http://oxhidingacne.com/machine_learning/theory/2022/06/19/wasserstein-distance-and-sinkhorn-metric-part-ii.html" rel="alternate" type="text/html" title="Wasserstein Distance and Sinkhorn Metric (Part II)" /><published>2022-06-19T01:25:14-04:00</published><updated>2022-06-19T01:25:14-04:00</updated><id>http://oxhidingacne.com/machine_learning/theory/2022/06/19/wasserstein-distance-and-sinkhorn-metric-part-ii</id><content type="html" xml:base="http://oxhidingacne.com/machine_learning/theory/2022/06/19/wasserstein-distance-and-sinkhorn-metric-part-ii.html"><![CDATA[<p>Recall that in the <a href="/machine_learning/theory/2022/06/18/wasserstein-distance-and-sinkhorn-metric.html">last post</a>, we gave the discrete version of the Wasserstein-p distance:</p>

\[W_p(\mu, \nu) = \left( \min\limits_{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d}  \langle D(x, y)^{(p)}, P(x, y)\rangle_F \right)^{1/p}\]

<p>For convenience, we denote the set of all joint distribution matrices with marginals
\(\mu, \nu\) as</p>

<p>\(U(\mu, \nu) = \{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d\}\).</p>

<p>In fact this space is a convex polytope and we call it the <strong>Transport Polytope</strong>.</p>

<h2 id="information-theory-revisited">Information Theory Revisited</h2>
<p>You might also recall our discussion of Shannon entropy and KL divergence from the
last post. I might have given the impression that these concepts have been displaced
by the new exciting world of optimal transport, but as it turns out, they are very important
for computing the Sinkhorn metric!</p>

<p>As we can readily gather from the problem setting, there can potentially be a lot of
joint distributions with marginals \(\mu, \nu\), even in the discrete case. However,
we can get a constraint on exactly which joint distributions are possible through computing
its Shannon entropy.</p>

<p><strong>Theorem</strong><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup>: \(H(P) \le H(\mu) + H(\nu)\), and the inequality is sharp.</p>

<p>In other words, the Shannon entropy of the joint distribution is always bounded by
the sum of the Shannon entropies of the marginal distributions. Moreover, this bound
is sharp since we can always construct the independent joint distribution matrix \(\mu\nu^T\).</p>

<p>With this constraint in mind, we define a subset of the transport polytope:</p>

\[U_{\alpha}(\mu, \nu) = \{ P \in U(\mu, \nu) | KL(P || \mu\nu^T) \le \alpha\}\]

<p>Intuitively, this subset contains all joint distributions whose Shannon entropy is less than
\(\alpha\) away from the maximum possible entropy of \(H(\mu) + H(\nu)\) (which is attained
by the joint distribution \(\mu\nu^T\)). I.e. all joint distributions which have
<strong>sufficient entropy</strong> with respect to \(H(\mu), H(\nu)\).</p>

<h2 id="sinkhorn-metric">Sinkhorn Metric</h2>
<p>Now we are ready to define the <strong>Sinkhorn Metric</strong>:</p>

\[d_{D, \alpha}(\mu, \nu) = \min\limits_{P(x, y) \in U_{\alpha}(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\]

<p>We see that this distance is similar to the Wasserstein-1 distance except we are only looking
for the minimum distance among joint distributions with sufficient entropy with respect to \(H(\mu), H(\nu)\).</p>

<p>Why only consider this restricted subset of joint distributions? The authors of <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>
explain that in solutions to the optimal transport problem, the optimum joint distributions
are often on the corners of the transport polytope, and thus are statistical outliers.
By setting this entropic restriction, the space of joint distributions is smoothened,
which allows for easier computation.</p>

<h3 id="special-cases">Special cases</h3>
<p>In one extreme, let us taken \(\alpha\) to infinity. Then as \(\alpha\) grows larger, the set of
allowable entropies gets larger, so \(U_{\alpha}(\mu, \nu)\) gets larger until it converges
to \(U(\mu, \nu)\) proper. So we can see that \(U_{\alpha}(\mu, \nu)\) is a good approximation of
the transport polytope with large enough \(\alpha\).</p>

<p>In the other extreme, if we set \(\alpha = 0\), we can solve the optimal transport problem
in closed form given any distance matrix. In fact we have:</p>

<p><strong>Theorem</strong>: \(d_{D, 0} = \mu^T D \nu\).</p>

<p>Beyond these special cases, the author proves the symmetry and triangle inequality properties.
The proof is similar to C. Villani’s proof in <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>.</p>

<h2 id="duality">Duality</h2>
<p>In many optimization problems, solving a problem amounts to solving its <strong>dual problem</strong>.
Perhaps the most popular example is with Lagrange multipliers. Given a (constrained) optimization
problem</p>

\[p^* = \min\limits_x f_0(x): f_i(x) \le c_i, i = 1, \cdots, m\]

<p>its dual problem turns the problem into an unconstrained one, i.e.</p>

\[d^* = \max\limits_{\lambda \ge 0 } \min\limits_x \mathcal{L}(x, \lambda)\]

<p>where \(\mathcal{L}(x, \lambda) = f_0(x) + \displaystyle\sum\limits_{i=1}^m \lambda_i f_i(x)\) is
called the <strong>Lagrangian</strong> and the parameters \(\lambda_i\) are called <strong>Lagrange multipliers</strong>.</p>

<p>It may seem at first glance that the optimal transport problem \(\min\limits_{P(x, y) \in U(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\) lacks constraints, but by considering entropy, we do have a hard
constraint, namely that \(H(P) \le H(\mu) + H(\nu)\)! Therefore we can transform the optimal transport
problem into the dual problem</p>

\[\min\limits_{P(x, y) \in U(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F - \frac{1}{\lambda} H(P)\]

<p><em>(the author uses a different convention of putting the Lagrange multiplier in the denominator, this
    does not change the optimization problem)</em></p>

<p>Notice that by introducing the sufficient entropy bound \(\alpha\), we get a dual problem for each
value of \(\alpha\) and indeed a corresponding Lagrange multipler for each. Therefore we can also
dualize the Sinkhorn distance \(\min\limits_{P(x, y) \in U_{\alpha}(\mu, \nu)} \langle D(x, y), P(x, y)\rangle_F\).</p>

<p>It turns out that computing the dual problem is computationally simpler.</p>

<h2 id="algorithm">Algorithm</h2>

<p>The algorithm starts with computing the matrix \(K = \exp(-\lambda D)\), which is
the <strong>elementwise</strong> exponential of the distance matrix \(D\). Then it uses the
<a href="http://seaver-faculty.pepperdine.edu/dstrong/Research/Files/PME2010.pdf">Sinkhorn fixed point iteration algorithm</a><sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> to compute the dual joint distribution matrix \(P_{\lambda}\). Finally, after some
convergence threshold, the final iteration of the matrix is used to compute the Sinkhorn distance
by taking a Frobenius inner product with \(D\).</p>

<p>Below is pseudocode of the algorithm (all dots before operators indicate elementwise operations):</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">function</span> <span class="n">dual_sinkhorn</span><span class="p">(</span><span class="no">D</span><span class="p">,</span> <span class="nb">lambda</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">nu</span><span class="p">){</span>
    <span class="no">K</span> <span class="o">=</span> <span class="p">.</span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="p">\</span><span class="nb">lambda</span> <span class="no">D</span><span class="p">)</span> <span class="c1">#Elementwise exponentation of the matrix</span>
    <span class="n">u</span> <span class="o">=</span> <span class="n">ones</span><span class="p">(</span><span class="n">size</span><span class="p">(</span><span class="n">mu</span><span class="p">))</span>
    <span class="no">K_tilde</span> <span class="o">=</span> <span class="n">diag</span><span class="p">(</span><span class="mi">1</span><span class="p">.</span> <span class="nf">/</span> <span class="n">r</span><span class="p">)</span> <span class="o">*</span> <span class="no">K</span>  
    <span class="k">while</span> <span class="n">u</span> <span class="n">changes</span> <span class="n">more</span> <span class="n">than</span> <span class="n">some</span> <span class="ss">threshold:
        </span><span class="n">u</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span> <span class="nf">/</span> <span class="no">K_tilde</span> <span class="o">*</span> <span class="p">(</span><span class="n">nu</span> <span class="p">.</span><span class="nf">/</span> <span class="p">(</span><span class="no">K_tilde</span><span class="p">.</span><span class="nf">transpose</span><span class="p">()</span> <span class="o">*</span> <span class="n">u</span><span class="p">))</span> <span class="c1">#Sinkhorn fixed-point iteration</span>
    <span class="n">v</span> <span class="o">=</span> <span class="p">(</span><span class="n">nu</span> <span class="p">.</span><span class="nf">/</span> <span class="p">(</span><span class="no">K_tilde</span><span class="p">.</span><span class="nf">transpose</span><span class="p">()</span> <span class="o">*</span> <span class="n">u</span><span class="p">)</span>
    <span class="n">d</span> <span class="o">=</span> <span class="n">sum</span><span class="p">(</span><span class="n">u</span> <span class="p">.</span><span class="nf">*</span> <span class="p">((</span><span class="no">K</span> <span class="p">.</span><span class="nf">*</span> <span class="no">D</span><span class="p">)</span><span class="n">v</span><span class="p">))</span> <span class="c1">#Frobenius inner product of K and D</span>
<span class="p">}</span></code></pre></figure>

<p>The author does in fact extend the algorithm to simultaneously calculate sinkhorn distances for \(N\) target distributions simultaneously. But this is just a matter of concatenating vectors and will be omitted
in this post.</p>

<h2 id="computation-package">Computation Package</h2>

<p>Now that we’ve explained and summarized the approach. You can implement this yourself
in your favorite language/computational software. However I have found a very good python
package that already has Sinkhorn metrics built in. The package is <a href="https://www.kernel-operations.io/geomloss/">Geomloss</a> written by machine learning researcher <a href="https://www.jeanfeydy.com/">Jean Feydy</a>.
This project is open source and available on Github. It is also compatible with Pytorch
tensors. I highly recommend it if you need to compute
Wasserstein distances.</p>

<p>Some example code you can use with this package (after installation):</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">from</span> <span class="n">geomloss</span> <span class="n">import</span> <span class="no">SamplesLoss</span>

<span class="n">torch</span><span class="p">.</span><span class="nf">random</span><span class="p">.</span><span class="nf">manual_seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="no">X</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">rand</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span>
<span class="no">Y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">rand</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span>

<span class="no">Loss</span> <span class="o">=</span>  <span class="no">SamplesLoss</span><span class="p">(</span><span class="s2">"sinkhorn"</span><span class="p">,</span> <span class="n">blur</span><span class="o">=</span><span class="mf">0.05</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="no">Loss</span><span class="p">(</span> <span class="no">X</span><span class="p">,</span> <span class="no">Y</span> <span class="p">).</span><span class="nf">item</span><span class="p">())</span></code></pre></figure>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:3" role="doc-endnote">
      <p>Cover, Thomas: <strong>Elements of Information Theory</strong> (1991) <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>Marco Cuturi: <strong>Sinkhorn Distances: Lightspeed Computation of Optimal Transport</strong> (2013) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Cedric Villani: <strong>Optimal Transport: Old and New</strong> (2008) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Kristen Anderson, Ashley Burt, Will Cousins, Brent Hancock, David Strong: <strong>A Sinkhorn-Knopp Fixed Point Problem</strong> (2010) <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="machine_learning" /><category term="theory" /><summary type="html"><![CDATA[Recall that in the last post, we gave the discrete version of the Wasserstein-p distance:]]></summary></entry><entry><title type="html">Wasserstein Distance and Sinkhorn Metric (Part I)</title><link href="http://oxhidingacne.com/machine_learning/theory/2022/06/18/wasserstein-distance-and-sinkhorn-metric.html" rel="alternate" type="text/html" title="Wasserstein Distance and Sinkhorn Metric (Part I)" /><published>2022-06-18T12:25:14-04:00</published><updated>2022-06-18T12:25:14-04:00</updated><id>http://oxhidingacne.com/machine_learning/theory/2022/06/18/wasserstein-distance-and-sinkhorn-metric</id><content type="html" xml:base="http://oxhidingacne.com/machine_learning/theory/2022/06/18/wasserstein-distance-and-sinkhorn-metric.html"><![CDATA[<p>Optimal transport is becoming a hotly investigated theory in machine learning.
What was previously a theory in mathematics with certain applications in economics
is now finding many use cases in machine learning, such as image translation.
A key concept in optimal transport theory in the Wasserstein distance, which is
a metric used to determine a distance between differing datasets.
In this post we will introduce Wasserstein distance and how it differs from more
traditional measures. In the following post, we will introduce an efficient means of
computating Wasserstein distances called the Sinkhorn metric.</p>

<h2 id="comparing-datasets">Comparing Datasets</h2>
<p>In many applications of machine learning, one might find it useful to compare two
different datasets. For example, one might want to compare the distributions of
training data to testing data. Or one might want to compare different training datasets
for the purposes of transfer learning. Whatever the case, it is important to have
a metric to determine how “different” (in terms of distribution) the datasets are to each other,
assuming we don’t know their distributions beforehand.</p>

<h2 id="an-information-theoretic-perspective">An Information-theoretic Perspective</h2>
<p>Traditionally the answer to the problem lay in information theory. Given two random
variables \(X\) and \(Y\), we may treat \(X\) as data that’s already observed (i.e. training data)
and \(Y\) as a model that aims to approximate \(X\) (i.e. testing data).</p>

<h3 id="shannon-entropy">Shannon Entropy</h3>
<p>From an information-theoretic perspective, a (discrete) random variable \(X\) can be thought of
containing information, depending on what its probability distribution is like. This
information can be encoded in bits of 0s and 1s.</p>

<p>Now we want to ask, is there a quantifiable difference between a fair coin (50/50 heads or tails)
and a rigged coin? It turns out that the amount of information needed to encode the two types of
coins is very different.</p>

<p>Suppose for the moment that you know the coin is fair. And someone flips the coin 10 times
without your knowledge. You want to know how the flips landed.
Out of these 2 flips we have a total of \(2^{10}\) possible outcomes of ordered sequences of heads or tails. Suppose the 2 flips were done without your knowledge, and
you wanted to know exactly which outcomes occurred on the tosses. You must ask
the tosser a series of yes/no questions in order to find out which coins were heads and which coins
were tails. How many questions would you need to ask on average, in order to figure out
the configurations of the coin tosses?</p>

<p>The first thing that may come into your mind is to ask sequentially:</p>
<ul>
  <li>Is the first coin heads?</li>
  <li>Is the second coin heads?</li>
</ul>

<p>Then we would need to ask 2 questions for any particular experiment,
in order to know exactly how the 2 coin flips landed. In terms of information, we can encode
this as two bits, the leftmost bit encodes the answer to the first question and the
rightmost bit encodes the answer to the second question. E.g. the string 10 encodes the answers
yes to the first question, and no to the second question. Since this uniquely determines the
configuration, the string 10 corresponds to heads-tails.</p>

<p>Since the coin is fair, it means that every one of the \(2^2\) configurations is
equally likely, since we needed to ask 2 questions for each, the average number of yes/no questions requires is 2. The average number of questions required to get a single coin flip is 1. That’s how much <strong>information</strong> we needed to figure out the outcomes of the coin tosses. So for a fair coin, we may say that it requires 1 <strong>bit</strong> of information to encode it’s distribution.</p>

<p>Now what if you knew that the coin was unfair? Let’s take a very extreme example, the coin is
magnetized on one side and there’s a hidden magnet underneath the table so that it
always comes up heads, and you know this. How many questions (on average) would we need to ask in this case?</p>

<p>The answer is 0. Because you already know that the coin will come up heads every time.
So no questions required. Therefore a coin which always comes up heads requires 0 information
to encode.</p>

<p>Now what about something in between? Suppose that \(P(Heads) = \frac{3}{4}\) and \(P(Tails) = \frac{1}{4}\).</p>

<p>It may still seem like we need \(n\) questions for \(n\) tosses to figure out the exact
configuration. But it turns out that in this biased case, we can do better.</p>

<p>Consider the following series of questions we ask for a 2-toss experiment:</p>
<ol>
  <li>Is there at least one tails? -&gt; If answer is no, then we must have head-head.</li>
  <li>If 1. is yes, is the first coin tails? -&gt; If answer is no, then we have head-tail (since we know there is at least one tail and it’s not the first coin).</li>
  <li>If 2. is yes, is the second coin heads? -&gt; If answer is yes, then we have tail-head. If answer is no, then we must have tail-tail.</li>
</ol>

<p>In terms of encoding, we encode each question as a bit once again. So for example, the string
110 represents: yes to question 1 and 2, and no to question 3. So 110 corresponds to the
configuration tail-tail, for example.</p>

<p>So as we can see, here we use a strategy that is not uniform, in the sense that some configurations
will lead us to only ask one question (head-head), whereas some others require 3 questions (tail-tail).
Since we are interested in the <strong>average</strong> number of questions, this allows us to surpass the naive
strategy of asking 2 sequential questions (which is 2 on average as well).</p>

<p>Why is this? Let’s look at the probabilities of each configuration and the number of corresponding
questions:</p>
<ul>
  <li>Head-head: probability = \((\frac{3}{4})^2 = \frac{9}{16}\), questions = 1</li>
  <li>Head-tail: probability = \(\frac{3}{4} \cdot \frac{1}{4} = \frac{3}{16}\), questions = 2</li>
  <li>Tail-head: probability = \(\frac{1}{4} \cdot \frac{3}{4} = \frac{3}{16}\), questions = 3</li>
  <li>Tail-tail: probability = \(\frac{1}{4} \cdot \frac{1}{4} = \frac{1}{16}\), questions = 3</li>
</ul>

<p>Then the expected (or average) number of questions is given by</p>

\[\frac{9}{16} \cdot 1 + \frac{3}{16} \cdot 2 + \frac{3}{16} \cdot 3 + \frac{1}{16} \cdot 3 = 1.6875\]

<p>and since this gives us 2 tosses, the average number of questions per toss is 0.8435. Lower than the 1
bit (or question) required for a fair coin. In terms of information, we needed an average of 0.8435 bits
to encode this probability distribution.</p>

<p>Now it’s easy to see why this strategy won’t work for a fair coin, since each configuration has equal
probability, we would be taking the average of \(\{1, 2, 3, 3\}\) which gives us a number greater than 2.
So for the fair coin, asking the value of each coin sequentially is actually the best we can do. This gives us a meaningful and quantitative way of distinguishing between coins of different biases. I.e. the fair coin
needs more information to “figure out” or encode, and the more biased a coin is, the less information is needed to “figure it out”. So a fair coin contains more information content.</p>

<p>In the case of the biased coin, how can we be sure that we’ve asked the best set of
questions (i.e. most efficient encoding)? Formally, Claude Shannon has proven an achievable lower bound
for the amount of information needed. Given a discrete random variable \(X\), its <strong>Shannon entropy</strong>
is given by:</p>

\[H(X) = - \displaystyle\sum\limits_x p(x) \log_2 p(x)\]

<p>The base in the logarithm indicates the unit of information we’re dealing with. Usually base 2 is chosen because we quantify information in terms of bits. A quick calculation shows that the Shannon entropy of a fair coin is 1, however the Shannon entropy of the biased coin we looked at from earlier is in fact around 0.81. Which means that in our example, we didn’t use the best possible encoding.</p>

<h3 id="kullback-leibler-divergence">Kullback-Leibler Divergence</h3>
<p>The Kullback-Leibler (KL) Divergence is a relative measure of how much information is lost
when trying to approximate the known information \(P\) with the new model \(Q\) (defined
    on the same sample space). More
precisely, it measures the number of extra bits required to code samples from \(P\)
based on the new code \(Q\). Given densities \(p(x), q(x)\) for \(P, Q\) respectively.
 Its formula is as follows</p>

\[KL(P||Q) = \displaystyle\sum\limits_x p(x) \log\left(\frac{p(x)}{q(x)}\right)\]

<p>Intuitively, the log factor measures the difference in information between the two
distributions, and therefore the sum becomes the expected difference in information
between \(P\) and \(Q\).</p>

<p>To take the example of our two coins from before. Let \(P\) denote the random variable
of the fair coin and \(Q\) denote the random variable of the coin with \(P(Heads) = \frac{3}{4}\) and \(P(Tails) = \frac{1}{4}\). The KL divergence from \(Q\) to \(P\) is given by</p>

\[KL(P||Q) = \frac{1}{2} \log\left(\frac{1/2}{3/4}\right) + \frac{1}{2} \log\left(\frac{1/2}{1/4}\right) \approx 0.143\]

<p>We interpret this as saying – if we have an optimized set of questions which tells us
configurations of the biased coin. We would need an extra 0.143 questions (on average) per coin
in order to figure out the configurations of the fair coin. This makes sense since
the fair coin holds more information, as we’ve already seen.</p>

<h3 id="caveats">Caveats</h3>
<p>Now this seems like a perfect way to measure differences between distributions.
For example, one might have a training set and wonder how much extra information is
needed to classify the testing set, knowing an optimal function to classify the training set.
However, there’s one huge caveat: KL divergence is not a distance function!</p>

<p>To see this, note that for any reasonable definition of distance, going from
a point A to a point B should yield the same distance as going back from B to A.
However KL divergence doesn’t have this property! Lets take the two coins in the
previous section, now we want to compute the KL divergence from \(P\) to \(Q\):</p>

\[KL(Q||P) = \frac{3}{4} \log\left(\frac{3/4}{1/2}\right) + \frac{1}{4} \log\left(\frac{1/4}{1/2}\right) \approx 0.203 \neq KL(P||Q)\]

<p>So given our sequential strategy of asking one coin at a time for the fair coin, if we applied
the same strategy to the biased coin, it would require on average additional questions!
This happens even though the biased coin contains less information, because of course,
the optimal encoding for one distribution might not be the optimal encoding for the other, and
vice-versa.</p>

<p>Another property that the KL divergence <strong>does not</strong> satisfy is the <strong>triangle inequality</strong>, formally
it takes the form</p>

\[d(x, z) \le d(x, y) + d(y, z)\]

<p>for points \(x, y, z\) in the same space and a distance function \(d\). Intuitively, it says that
going from a point A to another point B is always at least as short as going to an
intermediate point in between.</p>

<p>To show an example of this failure, consider again the fair and biased coins \(P, Q\) from before, with
an additional biased coin \(R\) with \(P(Heads) = \frac{9}{10}\) and \(P(Tails) = \frac{1}{10}\).
We leave you to show that \(KL(P||Q) &gt; KL(P||R) + KL(R||Q)\).</p>

<p>This somehow implies that switching directly to a new dataset requires more information
(e.g. extra training) than going to some third dataset, and then switching to the new dataset, which
seems paradoxical.</p>

<h2 id="costs-of-moving">Costs of Moving</h2>
<p>Sometimes people will use the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jansen-Shannon divergence</a> as a symmetrized version of the KL divergence. Unfortunately
this only takes care of the symmetry property but not the triangle inequality property (in general).</p>

<p>More recently, machine learning researchers have started to think about this problem of comparing
distributions using a different perspective: rather than think about some notion of intrinsic
difference between two distributions, instead think about how much it would cost (this could be
computing power, physical energy, monetary cost, etc.) to move the data from one distribution to
the other.</p>

<p>In a way this is even more useful than KL divergence. Even though KL divergence tells us
how much information it takes to move from one distribution to another, it’s agnostic
on how much each bit of information <strong>costs</strong>. In real-world applications, we care
very much about costs, and some forms of information may be cheaper than others.
As we will see shortly, this interpretation involving costs of moving also gives us
a bona fide distance function to work with.</p>

<h2 id="wasserstein-distance">Wasserstein Distance</h2>
<p>The original motivation behind Wasserstein distance, and indeed the field of
optimal transport, comes from transporting goods and resources over distances.
For example, a copper mine is located some distance away from the factories, and
manufacturers often needed to calculate the most cost-efficient way to transport
raw copper to the factories for processing. This is often a hard problem because
there are many paths to choose from (different roads have different costs) and also
because there are multiple mines and factories (which mine should transport to which
factory).</p>

<p>By this analogy, given distributions \(P\) and \(Q\), we can think of samples in \(P\)
as locations of the mines and samples in \(Q\) as the locations of the factories. Then what
we want to do is to calculate the most cost-efficient way to move the copper (samples of \(P\))
to the factories (samples of \(Q\)). For now let’s declare the cost of moving from point \(x\) to point \(y\) the Euclidean distance between them: \(||x - y||\).</p>

<p>Now let us set up the definition of Wasserstein distance. Suppose we have two random variables
\(X\) and \(Y\) jointly distributed on a probability space \(\mathcal{X}\), such that
the marginal densities of \(X\) and \(Y\) are given by \(\mu, \nu\) respectively.
Moreover suppose we have a distance function \(d(x, y)\).</p>

<p>Then we define the <strong>Wasserstein-p distance</strong> as:</p>

\[W_p(\mu, \nu) = \left( \inf\limits_{\pi \in \Pi(\mu, \nu)} \displaystyle\int_{\mathcal{X}} ||x - y||^p d\pi(x, y) \right)^{1/p}\]

<p>where the infimum is taken over all possible joint distributions \(\pi\) of \(X, Y\) with marginals \(\mu\) and \(\nu\).</p>

<p>Above is the general definition for continuous random variables, but since we are mostly concerned
with doing computations and optimizations, we are more concerned with the discrete case. In the discrete
setting, the probability space is finite with size \(d = |\mathcal{X}|\). The distance function \(d(x, y)\) becomes a distance matrix \(D(x, y)\) of size \(d \times d\), and the joint distribution also becomes a matrix \(P(x, y)\) of the same size. Here
the marginal distributions become the vectors \(\mu = P \cdot 1_d, \nu = P^T 1_d\), the row and column sums of the joint distribution matrix.</p>

<p>So the <strong>discrete Wasserstein-p distance</strong> is given by:</p>

\[W_p(\mu, \nu) = \left( \min\limits_{P(x, y) \in M(d, d)^+, \mu = P \cdot 1_d, \nu = P^T 1_d} \langle D(x, y)^{(p)}, P(x, y)\rangle_F \right)^{1/p}\]

<p>where \(\langle \cdot, \cdot \rangle_F\) denotes the <a href="https://en.wikipedia.org/wiki/Frobenius_inner_product">Frobenius inner product</a> for matrices, and \(D(x, y)^{(p)}\) denotes the <strong>elementwise</strong> \(p\)-th power of \(D(x, y)\).</p>

<p>It’s easy to see that under both definitions, the Wasserstein-p distance is symmetrical. The proof of
triangle inequality is more involved, see <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.</p>

<h2 id="conclusion">Conclusion</h2>
<p>In this post we have summarized the long-standing topic of calculating differences between
distributions. We have presented an information-theoretic perspective and defined KL divergence
and Wasserstein(-p) distance. In the next post we will cover a practical way of computing
these quantities.</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Cedric Villani: <strong>Optimal Transport: Old and New</strong> (2008) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="machine_learning" /><category term="theory" /><summary type="html"><![CDATA[Optimal transport is becoming a hotly investigated theory in machine learning. What was previously a theory in mathematics with certain applications in economics is now finding many use cases in machine learning, such as image translation. A key concept in optimal transport theory in the Wasserstein distance, which is a metric used to determine a distance between differing datasets. In this post we will introduce Wasserstein distance and how it differs from more traditional measures. In the following post, we will introduce an efficient means of computating Wasserstein distances called the Sinkhorn metric.]]></summary></entry><entry><title type="html">Uncanny Musical Similarities - I</title><link href="http://oxhidingacne.com/music/2022/04/01/uncanny-musical-similarities-1.html" rel="alternate" type="text/html" title="Uncanny Musical Similarities - I" /><published>2022-04-01T12:00:14-04:00</published><updated>2022-04-01T12:00:14-04:00</updated><id>http://oxhidingacne.com/music/2022/04/01/uncanny-musical-similarities-1</id><content type="html" xml:base="http://oxhidingacne.com/music/2022/04/01/uncanny-musical-similarities-1.html"><![CDATA[<p>In the spirit of April Fool’s day I have started a series on this blog called <em>Uncanny Musical Similarities</em>. In this post and subsequent follow-ups, I will highlight some very funny and mysterious musical coincidences in which different musicians have somehow converged on the same idea.</p>

<p>The format will be as follows: I will post two Youtube videos of the music side by side and indicate the time snippets during which the similarities happen. For better clarification I will also include a picture of the motif or idea that’s common in both compositions alongside the videos.</p>

<p><strong>Important:</strong></p>
<ul>
  <li>I am in <strong>no way implying any sort of plagiarism with these comparisons!</strong> That would defeat the purpose of this series. So I will not be comparing <strong>Sergei Rachmaninoff’s “Piano Concerto No. 2”</strong> to <strong>Eric Carmen’s “All by Myself”</strong>. The point is to bring attention to the absurd coincidences we sometimes encounter in music, and in life!</li>
  <li>Samples, tributes, and homages don’t count. It has to be a genuine coincidence! So I will not be comparing <strong>King Crimson’s “21st Century Schizoid Man”</strong> with <strong>Kanye West’s “Power”</strong>, nor will I be comparing <strong>Martika’s “Toy Soldiers”</strong> with <strong>Eminem’s “Like Toy Soldiers”</strong>.</li>
  <li>Standards and folk tunes don’t count, since music that use them are all referring to a common starting point–not a coincidence.</li>
</ul>

<p>To kick off the series, I present a cross-genre coincidence:</p>

<p><img src="/assets/uncanny1.png" alt="Marsalis &amp; Prokofiev" /><em>Common Motif</em></p>

<p><strong>Branford Marsalis Quartet - Dance of the Evil Toys</strong> (motif at 1:41, repeats throughout)</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/gN0Hyy-FUBE?start=101" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<p>and…</p>

<p><strong>Sergei Prokofiev - Piano Concerto No. 3</strong> (motif at 15:50, repeats a few times in movement 2)</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/BS0SwRoYAW0?start=950" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<hr />

<p>As a bonus, here’s an intra-genre coincidence:</p>

<p><img src="/assets/uncanny2.png" alt="Chopin and Brahms" /><em>A basic reduction of the common passage, different keys and meters are used in the pair of examples below</em></p>

<p><strong>Frederic Chopin - Etude Op. 10 No. 7</strong> (passage at 0:23, only appears once)</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/RZWjOzBl0zU?start=23" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<p>and…</p>

<p><strong>Johannes Brahms - Hungarian Dance No. 5</strong> (passage at 0:48, appears multiple times)</p>

<iframe width="560" height="315" src="https://www.youtube.com/embed/Nzo3atXtm54?start=48" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>

<hr />

<p>I will continue the series as I find more examples (which can be very hard!). So updates on a regular interval is probably out of the question.</p>]]></content><author><name></name></author><category term="music" /><summary type="html"><![CDATA[In the spirit of April Fool’s day I have started a series on this blog called Uncanny Musical Similarities. In this post and subsequent follow-ups, I will highlight some very funny and mysterious musical coincidences in which different musicians have somehow converged on the same idea.]]></summary></entry><entry><title type="html">Mixed Precision Training for Machine Learning (Part II)</title><link href="http://oxhidingacne.com/machine_learning/hardware/2022/03/28/mixed-precision-training-2.html" rel="alternate" type="text/html" title="Mixed Precision Training for Machine Learning (Part II)" /><published>2022-03-28T12:00:14-04:00</published><updated>2022-03-28T12:00:14-04:00</updated><id>http://oxhidingacne.com/machine_learning/hardware/2022/03/28/mixed-precision-training-2</id><content type="html" xml:base="http://oxhidingacne.com/machine_learning/hardware/2022/03/28/mixed-precision-training-2.html"><![CDATA[<p>In the last <a href="/machine_learning/hardware/2022/03/16/mixed-precision-training.html">post</a>, we talked about floating-point precision and the mixed point
precision method. We discussed how to cast the weights of a neural network into a
lower precision in order to save computation time, as well as using some very clever
tricks to preserve accuracy as well.</p>

<p>In this post, we will put these ideas into practice by training a simple neural
network on the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a> using mixed point precision. The machine learning framework used will be <a href="https://pytorch.org/">Pytorch</a> with the <a href="https://pytorch.org/docs/stable/amp.html">AMP (automatic mixed precision)</a> package.</p>

<p>As the name suggests, the AMP package will automatically carry out the mixed precision algorithm (master copy, loss scaling, arithmetic precision) without the user having to do floating point
operations manually.</p>

<p>Here is the basic code format taken from the <a href="https://pytorch.org/docs/stable/notes/amp_examples.html">Pytorch AMP examples page</a>:</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Packages required for automatic mixed precision</span>
<span class="n">from</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cuda</span><span class="p">.</span><span class="nf">amp</span> <span class="n">import</span> <span class="no">GradScaler</span><span class="p">,</span> <span class="n">autocast</span>

<span class="c1"># Creates model and optimizer in default precision</span>
<span class="n">model</span> <span class="o">=</span> <span class="no">Net</span><span class="p">().</span><span class="nf">cuda</span><span class="p">()</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="o">.</span><span class="no">SGD</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="nf">parameters</span><span class="p">(),</span> <span class="o">...</span><span class="p">)</span>

<span class="c1"># Creates a GradScaler once at the beginning of training.</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="no">GradScaler</span><span class="p">()</span>

<span class="k">for</span> <span class="n">epoch</span> <span class="k">in</span> <span class="ss">epochs:
    </span><span class="k">for</span> <span class="n">input</span><span class="p">,</span> <span class="n">target</span> <span class="k">in</span> <span class="ss">data:
        </span><span class="n">optimizer</span><span class="p">.</span><span class="nf">zero_grad</span><span class="p">()</span>

        <span class="c1"># The forward pass and loss function need to be encapsulated under autocast().</span>
        <span class="n">with</span> <span class="n">autocast</span><span class="p">():</span>
            <span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input</span><span class="p">)</span>
            <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span>

        <span class="c1"># The scale() function from the GradScaler object is called instead of</span>
        <span class="c1"># the usual loss.backward()</span>
        <span class="n">scaler</span><span class="p">.</span><span class="nf">scale</span><span class="p">(</span><span class="n">loss</span><span class="p">).</span><span class="nf">backward</span><span class="p">()</span>

        <span class="c1"># Scaling back the gradients is necessary if there are gradient operations</span>
        <span class="c1"># (such as clipping) to be done</span>

        <span class="c1"># torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max)</span>

        <span class="c1"># scaler.step() first unscales the gradients of the optimizer's assigned params.</span>
        <span class="c1"># If these gradients do not contain infs or NaNs, optimizer.step() is then called,</span>
        <span class="c1"># otherwise, optimizer.step() is skipped.</span>
        <span class="n">scaler</span><span class="p">.</span><span class="nf">step</span><span class="p">(</span><span class="n">optimizer</span><span class="p">)</span>

        <span class="c1"># A new line that's needed: updates the scaler for next iteration.</span>
        <span class="n">scaler</span><span class="p">.</span><span class="nf">update</span><span class="p">()</span></code></pre></figure>

<p>notice that the main differences are the precision casting in the feedforward and loss function calculation, and the loss scaling.</p>

<h1 id="experiments">Experiments</h1>

<p>We test out mixed precision training on an NLP classification task, with a pretrained <a href="https://arxiv.org/pdf/1810.04805.pdf">BERT (Bidirectional Encoder Representations from Transformers)</a> (cased) model fine-tuned on <a href="https://huggingface.co/datasets/imdb">IMDB Review Data</a> from <strong>HuggingFace Datasets</strong>. The dataset consists of reviews of films or TV series by IMDB users, and each review is labelled with a postive (1) sentiment or negative (0) sentiment. Our aim is to train the BERT model to classify reviews as positive or negative.</p>

<p>The Colab notebook can be found <a href="https://colab.research.google.com/drive/1LViOAk0QP1DevAhSSTKEJPOFCPUPHHbE#scrollTo=OnyfvHze1xBH">here</a>. Much of the code was taken from the excellent <a href="https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/">tutorial</a> on <a href="https://curiousily.com/">Curiousily</a>, as well as a <a href="https://huggingface.co/docs/transformers/tasks/sequence_classification">HuggingFace</a> pages.</p>

<p><strong><em>Warning:</em></strong> <em>Mixed precision only works with GPUs! So if you’re not using GPUs, go to Runtime &gt; Change Runtime Type &gt; Hardware Accelerator, and select GPU from the dropdown menu.</em></p>

<p>The reason for picking BERT lies both in its effectiveness and for its vast number of parameters (~110 million). Therefore we would expect that the techniques in mixed precision training would help us save a significant amount of training time.</p>

<p>We took a small sample of 1000 reviews from the dataset and split them into 800 training samples and 200 testing samples. We then trained 10 epochs with a batch size of 64 with the AdamW optimizer using both regular training and mixed precision training.</p>

<p>The results are summarized as follows:</p>

<table>
  <thead>
    <tr>
      <th>Training Method</th>
      <th>Time Taken (s) (*)</th>
      <th>Test Accuracy (after 10 epochs)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Regular</td>
      <td>807.8</td>
      <td>0.88</td>
    </tr>
    <tr>
      <td>Mixed Precision</td>
      <td><strong>412.2</strong></td>
      <td>0.88</td>
    </tr>
  </tbody>
</table>

<p>Notice that with mixed precision, we had almost cut down <strong>half</strong> the training time! Moreover, the test accuracy
was comparable between the two models.</p>

<p>Therefore it can be concluded that for large models such as BERT, mixed precision training is all but <strong>essential</strong>.</p>

<h2 id="smaller-models">Smaller Models</h2>

<p>One small caveat we should mention is that mixed precision is best done with bigger models. In another experiment, we have attempted mixed precision on the <a href="https://arxiv.org/pdf/1512.03385.pdf">Resnet18</a> model, which has around 11 million parameters. The task was to train Resnet18 on the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a> under the two training methods. Mixed precision did not offer any advantage in this case.</p>

<hr />

<p><strong>(*)</strong> Technically speaking, the running time also included time spent on validation after each epoch, but
validation of 200 samples takes so little time it can safely be ignored from the analysis.</p>]]></content><author><name></name></author><category term="machine_learning" /><category term="hardware" /><summary type="html"><![CDATA[In the last post, we talked about floating-point precision and the mixed point precision method. We discussed how to cast the weights of a neural network into a lower precision in order to save computation time, as well as using some very clever tricks to preserve accuracy as well.]]></summary></entry><entry><title type="html">Mixed Precision Training for Machine Learning (Part I)</title><link href="http://oxhidingacne.com/machine_learning/hardware/2022/03/16/mixed-precision-training.html" rel="alternate" type="text/html" title="Mixed Precision Training for Machine Learning (Part I)" /><published>2022-03-16T12:25:14-04:00</published><updated>2022-03-16T12:25:14-04:00</updated><id>http://oxhidingacne.com/machine_learning/hardware/2022/03/16/mixed-precision-training</id><content type="html" xml:base="http://oxhidingacne.com/machine_learning/hardware/2022/03/16/mixed-precision-training.html"><![CDATA[<p>When training your deep learning models, one major concern is training time. Today we summarize a method which leverages
your hardware when training models with a large number of parameters, in order to achieve a speed boost. This approach is called <strong>mixed precision training</strong> and is based on the seminal joint paper <a href="https://arxiv.org/pdf/1710.03740.pdf"><strong>Mixed Precision Training</strong></a><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> by Baidu and NVIDIA researchers.</p>

<h3 id="precision-formats">Precision Formats</h3>

<p>In machine learning and computing in general, more often than not that we must do computations outside the realm of integers. However, since most real numbers are irrational and our computers have finite memory, we can never represent a real number exactly on a computer.</p>

<p>Therefore computer engineers have developed hardware around the convention that we will store a real number <strong>up to</strong> a certain number of significant digits (in binary), and each significatn digit is stored as a bit in memory. We will soon touch on terms such as <strong>float32</strong> and <strong>float16</strong>.</p>

<h2 id="representing-a-number-in-binary">Representing a number in binary</h2>

<p>Our familiar number system uses base 10, i.e. we use the convention of representing every number as a sum of powers of 10, i.e.</p>

\[12.45 = 1 * 10^1 + 2 * 10^1 + 4 * 10^{-1} + 5 * 10^{-2}\]

<p>In binary the same number is represented as</p>

\[12.45 = 1100.01110011001 = 10^4 + 10^3 + 10^{-2} + 10^{-3} + 10^{-4} + 10^{-7} + 10^{-8} + 10^{-11}\]

<p>Since now we use powers of 2 instead of powers of 10, the only possible digits are 0’s and 1’s. Each binary digit is also called a <strong>bit</strong>. Already we notice that representing a number in binary instead of decimal yields more significant digits.</p>

<h2 id="single-precision">Single precision</h2>

<p>One convention we use for storing (an approximation) of real numbers on a computer is <strong>single precision</strong> or <strong>float32</strong>. This means that we will use 32 bits of memory to represent a number. However, this doesn’t just mean we shove 32 significant digits in there. Each bit serves a very specific purpose.</p>

<p><img src="/assets/float32.png" alt="Image from https://en.wikipedia.org/wiki/Single-precision_floating-point_format" /><em>Image from Wikipedia <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></em></p>

<p>The first bit \(s\) is the <strong>sign</strong>, it stores whether the number is negative or positive. The next 8 bits store the <strong>exponent</strong> \(\epsilon\) in binary. The remaining 23 bits store the <strong>significand</strong>, call this \(b = (b_1, \cdots, b_{23})\). The number associated with these 32-bits is given by</p>

<p>\((-1)^s \cdot 2^{\epsilon - 127} (1 + b)\).</p>

<p>Let’s calculate the range of such a configuration.
The maximum and minimum exponents are 255 and 0 respectively, so with the offset of -127 we have a max value of \(128\) and minimum value of \(-127\) for the exponent bits. However the maximum and minimum exponents are reserved for <strong>special numbers</strong>. So the actual max and min exponents we can use are \(127\) and \(-126\) respectively. For the significand, the maximum value is \(1 - 2^23\). With a plus sign in front, we get that the maximum possible value is</p>

\[2^{127} (2 - 2^{23}) \approx 3.4028235 * 10^{38}\]

<p>with the minimum possible value the negative of that number.</p>

<p>For precision, the smallest positive number is \(2^{-149} \approx 1.4 * 10^{-45}\) and the biggest negative number is the negative of that.</p>

<p>This range is good enough for almost all computational purposes.</p>

<p>As for the reserved special numbers, those have to do with overflow, underflow, and indeterminates. A very nice summary of floating point is given in these <a href="https://courses.cs.washington.edu/courses/cse401/01au/details/fp.html">course notes</a>.</p>

<h2 id="half-precision">Half Precision</h2>
<p>Half precision or float16 uses 16 bits: 1 for the sign, 5 for the exponent, and 10 for the significand.</p>

<p>This format results in a smaller range and more coarse precision than both single and double precision.
Positive numbers range from \(2^{−24} \approx 5.96 * 10^{-8}\) to \((2 - 2^{-10}) * 2^{15} = 65504\).</p>

<h2 id="double-precision">Double Precision</h2>
<p>There is another format called Double precision, or float64. Here we use 64 bits: 1 for the sign, 11 for the exponent, and 52 for the significand.</p>

<h3 id="tradeoff">Tradeoff</h3>
<p>Here we can see a tradeoff between the amount of memory used and the precision of the computation we are doing. The more precise we make our calculations, the more memory we need to use and vice-versa. Moreover, since we are using less digits, we also speed up computation time. For machine learning, we would obviously like to achieve the dream of using <strong>low memory</strong> to achieve <strong>high precision</strong>.</p>

<p>At first this seems like a logistical problem: How can we allocate more memory? How can we reduce the number of parameters in a deep neural network? But it turns out that we can exploit a bag of tricks of floating-point arithmetic we achieve our dream scenario.</p>

<h1 id="the-idea-and-the-tricks">The idea (and the tricks)</h1>

<p>To fully explain the idea, let us first think about what kind of arithmetic operations are done when we train a neural network. No matter what the type of network it is, be it fully-connected, convolutional, or recurrent, the main operations are one of the following: matrix multiplication, reduce operations (e.g. summing up all entries of a vector, mostly comes up when computing means), or activation functions.</p>

<p>In the old days we carried out all these operations in float32. Now that we realized float16 saves memory and speed, we would ideally like to carry out all these operations in float16.</p>

<p>However, a potential pitfall is the significant loss in precision. Consider the following float32 :
\(2^{-16} = 0 01101111 00000000000000000000000 \approx 0.00001525 = 1.525 * 10^{-5}\).
I have written the 32-bit representation on the right, but clearly since the exponent is -16, it cannot be represented as a float16 number. Which means that such numbers will be treated as 0 in float16. And of course as we know, quantities to the scale of \(10^{-5}\) or smaller is very common in machine learning (small learning rates, small gradients, etc.). In fact, experiments have shown that over half of activation gradients involved in training certain models are too small to be represented by float16 numbers, and therefore get rounded down to 0.</p>

<p>Coupled with the fact that each pass through the neural network involves potentially millions of arithmetic operations, a seemingly small error such as this can compound into bigger errors. Ultimately affecting model accuracy. So the key idea is to identify when exactly we can get away with using float16 instead of float32. Let’s use base-10 to give some examples.</p>

<ul>
  <li>Addition: Does not affect precision, if we add two numbers in base-10 with 2 significant digits for example: \(1.43 + 55.09\), the result still has at most 2 significant digits. Likewise, adding 2 float16’s stays in float16.</li>
  <li>Multiplication: Of course when we use float16 to multiply, every language we use that supports float16 will give us the product back in float16. So it may seem like multiplication doesn’t change precision either, but there is secretly a casting operation done under the hood. Again let us consider a base-10 example: \(0.2 * 0.3 = 0.06\). The multiplicands both had 1 significant digit but the product now has 2. If we were hypothetically working in the world of float1’s, allowing only 1 significant digit, then we will have to get rid of the 6 digit somehow. Depending on the implementation it could be rounding-up or rounding-down. Either way some precision is lost here.</li>
</ul>

<h2 id="trick-1-master-copy">Trick 1: Master Copy</h2>
<p>The mixed precision approach keeps a master copy of the weights in float32, with the float16 conversion being used in the actual model.</p>

<p>What this means is that in the beginning, before the first pass through the model, the weights are initialized in float16 and a float32 master copy is then copied from the initial weights are stored separately. Then before each forward and backward pass, the weights are converted from the master copy into float16 and used for the model (this has no effect for the first training iteration but has an impact later on). After carrying out all relevant computations in float 16 (with performance improvements outlined in the next 2 tricks), the weight update is done in float32 <strong>on the master copy</strong>.</p>

<p>In other words, we use float16 copies of the weights purely as a funnel for time-saving computations, whereas the actual weights we store for the model are in the float32 master copy. This ensures that we don’t lose precision.</p>

<p>But what about the precision lost when converting from the master copy into float16? Let’s do an example in base-10 to see what happens:</p>

<p>Consider doing gradient descent on the convex function \(f(w) = w^2\) with learning rate \(0.0001\), starting from the initialization \(w = 1\). Initially the weight \(w = 1\) is stored in “float2” and a master copy is made in “float6” <strong>(*)</strong>. After calculating gradients, we get that we should subtract \(2 * 0.0001 = 0.0002\) from the weight for the update (note that the weight updates are in the higher precision). This leaves us with the new weight being \(0.9998\). So far so good.</p>

<p>For the next iteration, the weight \(0.9998\) gets cast down to \(0.99\) before gradient computations, then the gradient would be \(0.99\) and the weight update is given by \(- 0.000198\). After updating, the new weight becomes \(0.999602\). But in the cast iteration the weight gets cast down to \(0.99\) again! So it seems like we got stuck in a local minimum with the weights.</p>

<p>However, we did get a nonzero gradient in the previous iteration, namely \(- 0.000198\), and the weight is continually being updated in the master copy. Which means that after more iterations, the master copy will eventually become \(0.98...\), and the “float2” version of the weight will change to \(0.98\), then \(0.97\), and so on, until \(f(w)\) reaches the global minimum of \(w = 0\).</p>

<h2 id="trick-2-loss-scaling">Trick 2: Loss Scaling</h2>
<p>We briefly mentioned that most of the activation gradients during training are too small to be represented by float16. However, it is rare to see huge gradients and loss functions(to the order of \(&gt; 2^{12}\)). This means that we can preserve all the information of a float32 gradient in a float64 format simply by <strong>shifting bits</strong>!.</p>

<p>The procedure is as follows:</p>
<ol>
  <li>Compute loss function in float32</li>
  <li>Convert loss value to float16 by shifting bits up.</li>
  <li>Backpropagate with the float16 loss value, notice that by the rules of differentiation, the (constant) scaling is preserved in this step.</li>
  <li>Scale the bits back down into float32.</li>
  <li>Do gradient clipping, weight decay, etc. (optional)</li>
</ol>

<p>The final gradient will be in float32 after this procedure, but we have saved a lot of computation time during the computationally intensive backpropagation phase. Moreover, we don’t lose any precision by doing this!</p>

<p>How is the scaling factor determined? A very simple trial-and-error approach: we start with the largest possible factor \(2^{15}\) and check if the result overflows, if it does then we drop the scaling factor by a factor of 2, and so on until overflow no longer happens. The authors of the paper have found that a scaling factor of \(2^3\) (3 bits) is good enough for most purposes.</p>

<h2 id="trick-3-arithmetic-precision">Trick 3: Arithmetic Precision</h2>
<p>In the training of neural networks, we often need to multiply matrices. When multiplying two matrices \(A_{m \times n} \cdot B_{n \times k}\), the \((i, j)\)-th entry of their product is given by
\(\displaystyle\sum\limits_{\ell=1}^n a_{i, \ell} b_{\ell, j}\).
Notice that such a computation consists of a series of scalar products added together.</p>

<p>As we have seen earlier, float16 numbers naturally multiply to float32 numbers. The strategy in the paper is to store the scalars of the matrices in float16, then when they are multiplied, keep them in float32 until all the addition operations are done, and only then cast them back to float16.</p>

<p>To see the upshot of this approach, let us again consider an example in base-10. We want to compute the quantity \(1.23 * 0.45 + 2.12 * (-5.94)\). These scalars are stored in “float2” at present. The result ends up being \(-12.0393\), which has 4 significant digits (in the real world).</p>

<p>Let us first carry out the usual computation of our computers in “float2”, which is to multiply into a “float4” number and then cast down to “float2” immediately. Then we get \(0.5535 + (-12.5928)\), each of which gets casted down to \(0.55 + (-12.60) = -12.05\).</p>

<p>Now let us accumulate the sums in “float4” before we cast back down. By using the round-down convention, we get \(-12.04\) when we convert back to “float2”. This means that accumulating the sums before casting back down preserves higher precision. This effect becomes compounded when the matrices are very large, resulting in more addition operations.</p>

<h1 id="conclusion">Conclusion</h1>

<p>The authors note that training models with mixed point precision has accuracy comparable to training in float32, with the added benefit of saving computation time for many models.</p>

<p>In the next post in this series, we will talk about how to implement mixed precision training using Pytorch.</p>

<p><strong>(*)</strong> It should be noted that the base-10 examples are purely for demonstration, and that the terms “float2”, “float4”, and so forth are meaningless for the floating point arithmetic of most CPU architectures, which are in binary. And of course, float32 does not mean that the maximum precision is exactly 32 digits for each number, since numbers can go as low as \(2^{-149}\).</p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p><em>Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G. &amp; Wu, H.</em> <strong>Mixed Precision Training</strong> (2017) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p><a href="https://en.wikipedia.org/wiki/Single-precision_floating-point_format">https://en.wikipedia.org/wiki/Single-precision_floating-point_format</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="machine_learning" /><category term="hardware" /><summary type="html"><![CDATA[When training your deep learning models, one major concern is training time. Today we summarize a method which leverages your hardware when training models with a large number of parameters, in order to achieve a speed boost. This approach is called mixed precision training and is based on the seminal joint paper Mixed Precision Training1 by Baidu and NVIDIA researchers. Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G. &amp; Wu, H. Mixed Precision Training (2017) &#8617;]]></summary></entry></feed>