Optimizing a multithreaded numpy array function

Question

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...

To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.

import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool

def threaded(sources, destinations):
    # Define worker function
    def worker(point):
        dlt = (destinations-point) # delta between destinations and given point
        d = inner1d(dlt,dlt) # get distances
        return np.argmin(d) # return closest index

    # Multithread!
    p = ThreadPool()
    return p.map(worker, sources)


def unthreaded(sources, destinations):
    results = []
    #for p in sources:
    for i in range(len(sources)):
        dlt = (destinations-sources[i]) # difference between destinations and given point
        d = inner1d(dlt,dlt) # get distances
        results.append(np.argmin(d)) # append closest index

    return results


# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000      # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100

#Compare!
print 'threaded:   %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]

Retults:

threaded:   0.894030461056
unthreaded: 1.97295164054

Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.

All recommendations to improve performance (within the limitations described above) will be greatly appreciated!

Before trying to add horsepower by means of multithreading, I would start with better algorithms... Currently your brute force approach is `O(N*M)`, while, after preparing a KD tree of the target points, you could make it `O(N*log(M))`. — Matteo Italia, Aug 05 '15 at 23:37
@ali_m i have to use a custom build of python called mayapy, Autodesk's implementation of python inside their 3D software (Maya). It's compiled against VC2010 64bit, so pre-built binaries such as [Christoph Gohlke's](http://www.lfd.uci.edu/~gohlke/pythonlibs/) and others aren't compatible. I was able to build numpy from scratch but failed to build scipy, numexpr, etc. — Fnord, Aug 06 '15 at 04:45
@MatteoItalia I did look for pure python KD tree implementations, notably i found [this one](https://github.com/storpipfugl/pykdtree) and [this one](https://code.google.com/p/python-kdtree/). Both were vastly slower (like 10x) than my unthreaded numpy test. So i'm afraid that without access to scipy (as stated above) KD trees are a dead end. — Fnord, Aug 06 '15 at 05:54
Just for reference, on my puny dual core laptop `cKDTree` takes about 17ms to both construct the tree *and* query it using your example data, compared with 1.53s for `threaded` and 1.39s for `unthreaded`. I am absolutely convinced that k-D trees are worth pursuing. I'm very surprised that you saw such slow query times for `pykdtree`, which seems to be implemented in C/Cython. — ali_m, Aug 06 '15 at 20:14
@ali_m i know cKDTree would be a huge speed boost but it's part of scipy which is unavailable to me. I took a look at pykdtree, it looks like it also needs to be built from scratch due to numpy dependencies. I tried to build both scipy and pykdtree and failed on both instances. Because i'm forced to use a non-standard distribution of python, i can't rely on normal package distributions, or pre-built binaries. Numpy was fairly easy to build, but scipy and others haven't played nice so far. But thanks for your suggestion! — Fnord, Aug 06 '15 at 21:38
I see - in your previous comment you said you had tried `pykdtree` and found that it was too slow. Yes, you would need to compile the C extension for `pykdtree`. — ali_m, Aug 06 '15 at 21:45

score 2 · Answer 1 · answered Aug 06 '15 at 17:28

Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:

They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)

Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.

Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).

On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Yeah it's very annoying. Because it seems there's no way to imrpove performance by staying purely in numpy, I may end up pickeling the data out and have an external "standard" python instance running (with scipy and all its arsenal) to crunch it down and spit out the result. Not what i want but at this time i need to move on. — Fnord, Aug 06 '15 at 22:01

Optimizing a multithreaded numpy array function

1 Answers1

Linked