13

I would like to compute the Earth Mover Distance between two 2D arrays (these are not images).

Right now I go through two libraries: scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).

#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
   #sample from N(0, 1) in the 2D hyperspace
   x = np.random.randn(n, 2)

   #scale N(0, 1) -> N(mu, std)
   x[:,0] = (x[:,0]*std1) + mu1
   x[:,1] = (x[:,1]*std2) + mu2

   return x

#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)

#compute the distance
distance = pyemd.emd_samples(Y1, Y2)

While the scipy version doesn't accept 2D arrays and it returns an error, the pyemd method returns a value. If you see from the documentation, it says that it accept only 1D arrays, so I think that the output is wrong. How can I calculate this distance in this case?

fuglede
  • 17,388
  • 2
  • 54
  • 99
Luca
  • 324
  • 2
  • 15

1 Answers1

21

So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment (which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).

In other words, what you want to do boils down to

from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment

d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)

It is also possible to use scipy.sparse.csgraph.min_weight_bipartite_full_matching as a drop-in replacement for linear_sum_assignment; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.

It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:

import networkx as nx

G = nx.DiGraph()

# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
    G.add_node(i, demand=-1)
    G.add_node(n + i, demand=1)

for i in range(n):
    for j in range(n):
        G.add_edge(i, n + j, capacity=1, weight=d[i, j])

At this point, we can verify that the approach above agrees with the minimum cost flow:

In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True

Similarly, it's instructive to see that the result agrees with scipy.stats.wasserstein_distance for 1-dimensional inputs:

from scipy.stats import wasserstein_distance

np.random.seed(0)
n = 100

Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d =  np.abs(Y1 - Y2.reshape((n, 1)))

assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)       # 1.9777950447866477
print(wasserstein_distance(Y1, Y2))  # 1.977795044786648
fuglede
  • 17,388
  • 2
  • 54
  • 99
  • 1
    I actually really like your problem re-formulation. This can be used for a limit number of samples, but it work. Thanks!! – Luca Aug 20 '19 at 20:43
  • Great, you're welcome. If the answer is useful, you can mark it as [accepted](https://stackoverflow.com/help/accepted-answer). And for the record, expect the 1000x1000 case it to take less than a second with the 1.4 implementation. – fuglede Aug 21 '19 at 18:54
  • Sorry, I thought that I accepted it. May I ask you which version of scipy are you using? Because I am working on Google Colaboratory, and using the last version "Version: 1.3.1". However, it still "slow", so I can't go over 1000 of samples. – Luca Aug 26 '19 at 13:38
  • 1
    Yes, 1.3.1 is the latest official release; you can pick up a pre-release of 1.4 from https://github.com/scipy/scipy (or just wait until it's officially available). Alternatively, if you don't mind building a package yourself, several efficient implementations of solvers for the problem can be found. For instance, the one available at https://gist.github.com/kylemcdonald/3dcce059060dbd50967970905cf54cd9 solves a 1000x1000 case in about 50 ms, where it would take 100 seconds in SciPy 1.3.1. See the comments to the gist for usage instructions. – fuglede Aug 26 '19 at 14:24
  • For a comparison of that one in particular with the implementation in SciPy, see https://nbviewer.jupyter.org/urls/gist.githubusercontent.com/fuglede/d7bb1dcdf23669eacc7eef3256198789/raw/04d3551643f98c642589cf85dd9319c9605697f8/lap-benchmark.ipynb -- there are also some useful benchmarks in https://github.com/scipy/scipy/pull/9800 and https://github.com/scipy/scipy/pull/10296 – fuglede Aug 26 '19 at 21:01
  • And SciPy 1.4 was released in December 2019, so those improvements are now generally available. – fuglede Feb 13 '20 at 17:19
  • "EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem" This doesn't sound right. In the Earth Mover setup, you're allowed to distribute each input point to multiple output points. This is not the case for linear sum assignment. It's very interesting that in the 1D case each input point gets assigned to exactly one output point (empirically), but it's definitely not obvious (to me) this should be the case in general. – Alex Eftimiades Aug 28 '20 at 19:53
  • 2
    @AlexEftimiades: Are you happy with the minimum cost flow formulation? If so, the integrality theorem for min-cost flow problems tells us that since all demands are integral (1), there is a solution with integral flow along each edge (hence 0 or 1), which in turn is exactly an assignment. – fuglede Aug 28 '20 at 20:44