6

I have a theano symbolic matrix

x = T.fmatrix('input')

x will be later on populated by n vectors of dim d (at train time).

I would like to have the theano equivalent of pdist (scipy.spatial.distance.pdist of pdist), something like

D = theano.pdist( x )

How can I achieve this?

Calling scipy.spatial.distance.pdist on x directly does not work as x at this stage is only symbolic...

Update: I would very much like to be able to mimic pdist "compact" behavior: that is, computing only ~1/2 of the nxn entries of the distance matrix.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • Are you looking for the full generality of pdist, or are you interested in a specific instance, e.g. euclidean distances? – eickenberg Sep 26 '14 at 17:07
  • @eickenberg euclidean distance would be a nice start. I believe once I have this implemented I'll be able to generalize it to other metrics. – Shai Sep 28 '14 at 05:46

2 Answers2

14

pdist from scipy is a collection of different functions - there doesn't exist a Theano equivalent for all of them at once. However, each specific distance, being a closed form mathematical expression, can be written down in Theano as such and then compiled.

Take as a example the minkowski p norm distance (copy+pasteable):

import theano
import theano.tensor as T
X = T.fmatrix('X')
Y = T.fmatrix('Y')
P = T.scalar('P')
translation_vectors = X.reshape((X.shape[0], 1, -1)) - Y.reshape((1, Y.shape[0], -1))
minkowski_distances = (abs(translation_vectors) ** P).sum(2) ** (1. / P)
f_minkowski = theano.function([X, Y, P], minkowski_distances)

Note that abs calls the built-in __abs__, so abs is also a theano function. We can now compare this to pdist:

import numpy as np
from scipy.spatial.distance import pdist

rng = np.random.RandomState(42)
d = 20 # dimension
nX = 10
nY = 30
x = rng.randn(nX, d).astype(np.float32)
y = rng.randn(nY, d).astype(np.float32)

ps = [1., 3., 2.]

for p in ps:
    d_theano = f_minkowski(x, x, p)[np.triu_indices(nX, 1)]
    d_scipy = pdist(x, p=p, metric='minkowski')
    print "Testing p=%1.2f, discrepancy %1.3e" % (p, np.sqrt(((d_theano - d_scipy) ** 2).sum()))

This yields

Testing p=1.00, discrepancy 1.322e-06
Testing p=3.00, discrepancy 4.277e-07
Testing p=2.00, discrepancy 4.789e-07

As you can see, the correspondence is there, but the function f_minkowski is slightly more general, since it compares the lines of two possibly different arrays. If twice the same array is passed as input, f_minkowski returns a matrix, whereas pdist returns a list without redundancy. If this behaviour is desired, it can also be implemented fully dynamically, but I will stick to the general case here.

One possibility of specialization should be noted though: In the case of p=2, the calculations become simpler through the binomial formula, and this can be used to save precious space in memory: Whereas the general Minkowski distance, as implemented above, creates a 3D array (due to avoidance of for-loops and summing cumulatively), which is prohibitive, depending on the dimension d (and nX, nY), for p=2 we can write

squared_euclidean_distances = (X ** 2).sum(1).reshape((X.shape[0], 1)) + (Y ** 2).sum(1).reshape((1, Y.shape[0])) - 2 * X.dot(Y.T)
f_euclidean = theano.function([X, Y], T.sqrt(squared_euclidean_distances))

which only uses O(nX * nY) space instead of O(nX * nY * d) We check for correspondence, this time on the general problem:

d_eucl = f_euclidean(x, y)
d_minkowski2 = f_minkowski(x, y, 2.)
print "Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy %1.3e" % ((d_eucl - d_minkowski2) ** 2).sum()

yielding

Comparing f_minkowski, p=2 and f_euclidean: l2-discrepancy 1.464e-11
eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • (If all this is totally beside your usecase, please let me know :)) – eickenberg Sep 26 '14 at 17:56
  • 1
    Thank you for the detailed answer. I am more interested in the case of getting only the list of `pdist`, i.e., the more compact mode where `x` and `y` inputs are actually the same and ~1/2 the calculations are not carried out. – Shai Sep 28 '14 at 05:52
  • OK, it should be possible to extract the upper triangle, and possibly the theano compiler will already understand and optimize, but I am not certain. I wrote something to go from an upper triangle list to a full matrix [here](http://stackoverflow.com/questions/25326462/initializing-a-symmetric-theano-dmatrix-from-its-upper-triangle/25334094#25334094) and one should be able to do the inverse. I'll check that. – eickenberg Sep 28 '14 at 08:56
  • @eickenberg I'm using your `squared_euclidean_distances` implementation, but it has some negative values and `sqrt` contains some `NaN`s. – Vektor88 Aug 10 '15 at 15:10
  • Yes, this can happen. Floating point computations yield different results according to the chosen factorization of the polynomial. It also happens in sklearn, which uses the same speedup. They solve it by clipping towards 0, which makes sense, so you could use `T.maximum(squared_..., 0.)` before the square root. – eickenberg Aug 10 '15 at 22:34
  • 1
    Also, be careful as the gradient of the square root becomes undefined near zero. I therefore suggest clipping not at zero but at a slightly higher value, or stopping the gradient using `T.zero_grad`. – eflorico Apr 26 '18 at 23:39
  • Yup, or by using the correct gradient descent step for these situations (an implicit one, given by the proximal operator) – eickenberg Apr 27 '18 at 19:10
-2

I haven't worked with Theano before, but here is a solution based on pure Numpy functions (perhaps you convert it to the equivalent theano functions. Note that I'm using automatic broadcasting in the expression below, so you might have to rewrite that explicitly if Theano doesn't support it):

# X is an m-by-n matrix (rows are examples, columns are dimensions)
# D is an m-by-m symmetric matrix of pairwise Euclidean distances
a = np.sum(X**2, axis=1)
D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, X.T))

It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v. (I showed this in previous answers of mine using MATLAB)

Here is a comparison against Scipy existing functions:

import numpy as np
from scipy.spatial.distance import pdist, squareform

def my_pdist(X):
    a = np.sum(X**2, axis=1)
    D = np.sqrt((a + a[np.newaxis].T) - 2*np.dot(X, X.T))
    return D

def scipy_pdist(X):
    D = squareform(pdist(X, metric='euclidean'))
    return D    

X = np.random.rand(5, 3)
D1 = my_pdist(X)
D2 = scipy_pdist(X)

The difference should be negligible, close to machine epsilon (np.spacing(1)):

>>> np.linalg.norm(D1-D2)
8.5368137554718277e-16

HTH


EDIT:

Here is another implementation with a single loop:

def my_pdist_compact(X):
    D = np.empty(shape=[0,0], dtype=X.dtype)
    for i in range(X.shape[0]-1):
        D = np.append(D, np.sqrt(np.sum((X[i,] - X[i+1:,])**2, axis=1)))
    return D

Somewhat equivalent MATLAB code:

function D = my_pdist_compact(X)
    n = size(X,1);
    D = cell(n-1,1);
    for i=1:n-1
        D{i} = sqrt(sum(bsxfun(@minus, X(i,:), X(i+1:end,:)).^2, 2));
    end
    D = vertcat(D{:});
end

This returns the pairwise-distances in compact form (upper triangular part of the symmetric matrix). This is the same output as pdist. Use squareform to convert it to full matrix.

>>> d1 = my_pdist_compact(X)
>>> d2 = pdist(X)    # from scipy.spatial.distance
>>> (d1 == d2).all()
True

I will leave it to you to see if it's possible to write the equivalent loop using Theano (see theano.scan)!

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • Thanks! However, I am looking for a way to save me the `squareform` result and get the compact result directly. – Shai Sep 21 '14 at 14:47
  • Moreover, I am looking for a more "theano"-ish way of doing this, as it may be easier for the compiler to optimize... – Shai Sep 21 '14 at 14:48
  • ignore me if this wrong (like I said I've never used Theano), but can't you just implement pdist using regular loops (you can iterate over rows or even over columns). From what I understand, the code is compiled into to C code, so the overhead of loops of an interpreted language is gone.. And if its clever enough, it should be able to make calls to vectorized functions from BLAS. – Amro Sep 21 '14 at 14:58
  • I'm afraid my theano skills are too basic for constructing symbolic nested loops... – Shai Sep 21 '14 at 15:09
  • @Shai: it doesn't have to be a nested loop, you can write in a single loop, which I think is possible to implement with `theano.scan` (see my edit) – Amro Sep 21 '14 at 16:11
  • the issue here is that in its symbolic form I do not have access to `x.shape`. The actual number of data points is determined after model construction... I need somehow to define `pdist` when I only have an abstract symbolic variable `x`. Am I making any sense here? – Shai Sep 21 '14 at 16:16
  • Theano can be misused in this way to treat it like a deferred computation engine, which is what the OP wants to do. It seems that the op wants the flexibility of declaring a dynamic tensor, but also wants Theano's optimizing compiler to "just know" how to apply an arbitrary scipy function, `pdist` to that dynamic data. This is the sort of task that numba is meant for, not Theano (which is meant for people to optimize known-in-advance computations). Unless you implement `pdist` natively in Theano, then optimizing "eventual execution of `pdist` on some data" is inappropriate. – ely Sep 21 '14 at 16:21
  • In other words, you need a pre-Theano-ized version of `pdist` if you want it to arbitrarily optimize that execution on a Theano tensor. Wanting to effectively "call" `pdist` on a Theano tensor represents a misunderstanding of what Theano is trying to do. Whereas, if you took the code from this answer and effectively used numba's jit decorator to compile it, you'd have a compiled version of the function you want to call. I hope the difference is clear. Numba is more meant to go from arbitrary function -> compiled version of that function (often with type/size annotation), Theano is not. – ely Sep 21 '14 at 16:26
  • I see, thanks for the clarification! I think I still have some reading to do on Theano :) – Amro Sep 21 '14 at 16:27
  • 3
    Theano is more like SymPy in this regard: it provides you with a catalog of functions it knows how to interpret, and you can symbolically play with all sorts of expressions that are accessible through combinations of that catalog. But it's not good at absorbing things outside of the catalog. A good rule of thumb when using Theano is this: Ask yourself, "Would I expect SymPy to handle this sort of calculation for me?" If the answer is no, you're probably (not always) using Theano inappropriately. That's not to say you can't make it work. Just that there are more appropriate approaches. – ely Sep 21 '14 at 16:28
  • I think I get it; it's likes a CAS engine that is able to parse/manipulate mathematical expressions and generate optimized C code to evaluate it. My experience in this is limited to using [Cython](http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/) for speeding this sort of calculations. Numba from what I understand is another way of doing the same thing, but using decorators instead of the explicit Cython syntax, right? – Amro Sep 21 '14 at 16:33
  • That's a good description. Yes, Theano is a way to deal with expressions symbolically before having the rubber-meets-the-road execution of the expression happen (and that execution could be transpiled to execute on different hardware, across a web service, whatever). Numba is very similar, except that it doesn't seek to expose a symbolic interface. Rather, it lets you declare which functions are to be JIT compiled, but then use them just as you would any other Python function. – ely Sep 21 '14 at 16:38
  • So, for example, once you "go Numba" with some code, you can just write Python code around it. But once you "go Theano" you are thus committed to only staying in the symbolic, Theano world for as long as you want to deal with those variables and constructs. Which is exactly what makes Theano great for targeting a GPU with pre-existing math functions, like taking a gradient or calculating a polynomial, but what makes it ill-suited for mixing in your own arbitrary code (or arbitrary library code) into that deferred-symbolic pool-o-computations. – ely Sep 21 '14 at 16:39