I am trying to understand the implementation that is used in scipy.stats.wasserstein_distance
for p=1
and no weights, with u_values
, v_values
the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values)
(1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values))
(2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values)
(3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right')
(4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size
(5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas))
(6)
What is the reasoning behind this implementation, is there some literature? I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|
,
with U and V the cumulative distribution functions for the distributions u_values
and v_values
,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf
and u_cdf
in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values
and v_values
is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!