4

I have a big matrix with millions of rows and hundreds of columns. The first n rows (about 100K) are reference rows, and for the others, I would like to find the k (about 10) closest neighbours in the reference vectors with scipy cdist

I created an multiprocessing.sharedctypes.Array from the matrix, and use asarray and slicing to split up the matrix and compute distances with cdist.

My current code looks like this:

import numpy
from multiprocessing import Pool, sharedctypes
from scipy.spatial.distance import cdist

shared_m = None

def generate_sample():
    m = numpy.random.random((20000, 640))
    shape = m.shape
    global shared_m
    shared_m = sharedctypes.Array('d', m.flatten(), lock=False)
    return shape

def dist(A, B, metric):
    return cdist(A, B, metric=metric)

def get_closest(args):
    shape, lenA, start_B, end_B, metric, numres = args
    m = numpy.ctypeslib.as_array(shared_m)
    m.shape = shape
    A = numpy.asarray(m[:lenA,:], dtype=numpy.double)
    B = numpy.asarray(m[start_B:end_B,:], dtype=numpy.double)
    distances = dist(B, A, metric)
    # rest of code to find closests

def p_get_closest(shape, lenA, lenB, metric="cosine", sample_size=1000, numres=10):
    p = Pool(4)
    args = ((shape, lenA, i, i + sample_size, metric, numres)
            for i in xrange(lenB / sample_size))
    res = p.map_async(get_closest, args)
    return res.get()

def main():
    shape = generate_sample()
    p_get_closest(shape, 5000, shape[0] - 5000, "cosine", 3000, 10)

if __name__ == "__main__":
    main()

My problem right now is that the parallel calls of cdist are somehow block each other. (Maybe I use the block expression incorrectly. The problem is that there are no parallel cdist computations)

I tried to trace back the problem with printouts into scipy/spatial/distance.py and scipy/spatial/src/distance.c to understand where the run blocks. It looks like there is no copying of data, the dtypes argument took care of that.

When putting printf into distance.c:cdist_cosine(), it shows that all the processes reach the point where the actual computation starts (before the for loops), but the computations don't run in parallel.

I tried a lot of things like using multiprocessing.sharedctypes.RawArray instead of Array, using lock=True while creating the Array.

I have no other idea what I did wrong or how to investigate more the problem.

zseder
  • 1,099
  • 2
  • 12
  • 15
  • Your example won't work properly: you're assigning a value `shared_m` in `generate_sample()` without using `global shared_m`. So `shared_m` is just a local variable, and ends up being `None` inside `get_closest`, which raises an exception. Also, setting it as a global variable like this probably won't work properly on Windows, either, because of the lack of `os.fork()`. Not sure if that's important to you. – dano Aug 04 '14 at 14:08
  • Yes, it was just copy-paste error because I removed a log of lines that were not needed, but I ran it with `global`. And running on Windows is not important for me at all. – zseder Aug 04 '14 at 15:25
  • 2
    If you're only interested in the distances to the *k* nearest neighbours of your query vector, computing the distances to *every other vector in your dataset* is a very inefficient way to go about this. Consider using a [`scipy.spatial.cKDtree`](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.cKDTree.html) instead. Also, 260 dimensions is very high - you might find some of the answers to [this question](http://stackoverflow.com/q/5751114/1461210) helpful. – ali_m Aug 11 '14 at 14:37
  • Yes, I read about `cKDtree`, the warning there about the dimensions made me search for something else, or implement it in a parallel way. Or, there will be no other options, I'll try with a heuristic using cKDtree, and only first 20 dimensions for prefiltering, and then computing the closest only in the closest 100 or something like that. Thank you for the other thread, I'll look into that! – zseder Aug 11 '14 at 14:48
  • I also came across [`KGraph`](http://www.kgraph.org), which implements a very promising-looking heuristic algorithm for finding approximate nearest neighbours in very high-dimensional spaces (it was designed for image similarity searches). However, if you want a distance metric other than euclidean then you'll have to write some C++! – ali_m Aug 11 '14 at 16:09

0 Answers0