3

After having troubles with MATLAB I decided to try Python:

I wrote a function that calculates kNN when the samples are of my own class using my own distance function:

def closestK(sample, otherSamples, distFunc, k):
"Returns the closest k samples to sample based on distFunc"
    n = len(otherSamples)
    d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]
    idx  = sorted(range(0,len(d)), key=lambda k: d[k])
    return idx[1:(k+1)]

def kNN(samples, distFunc, k):
    return [[closestK(samples[i], samples, distFunc, k)] for i in range(len(samples))]

and this is the distance function:

@staticmethod    
def distanceRepr(c1, c2):
    r1 = c1.repr
    r2 = c2.repr
    # because cdist needs 2D array
    if r1.ndim == 1:
        r1 = np.vstack([r1,r1])
    if r2.ndim == 1:
        r2 = np.vstack([r2,r2])

    return scipy.spatial.distance.cdist(r1, r2, 'euclidean').min()

But it still works amazingly slower compared to the "normal" kNN function, even when using "brute" algorithm. Am I doing something wrong?

UPDATE

I'm adding the constructor of the class. The attribute repr contains a set of vectors (from 1 to whatever) and the distance is calculated to be the minimal euclidean distance between the two sets of repr.

class myCluster:
    def __init__(self, index = -1, P = np.array([])):
        if index ==-1 :
            self.repr = np.array([])
            self.IDs = np.array([])
            self.n = 0
            self.center = np.array([])
        else:
            self.repr = np.array(P)
            self.IDs = np.array(index)
            self.n = 1
            self.center = np.array(P)

and the rest of relevant code (X is a matrix whose rows are samples and columns are variables):

level = [myCluster(i, X[i,:]) for i in range(0,n)]
kNN(level, myCluster.distanceRepr, 3)

UPDATE 2

I've made some measurements and the line that takes most of the time is

d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]

So there is something with the distFunc. When I change it to return

np.linalg.norm(c1.repr-c2.repr)

i.e. "normal" vector calculation, with no sorting, the running time stays the same. So the problem lies in the calling of this function. Does it make sense that the use of classes changes the running time by a factor of 60?

Community
  • 1
  • 1
Roy
  • 837
  • 1
  • 9
  • 22
  • What is this "normal" kNN function you are referring to precisely? –  Oct 16 '14 at 09:03
  • @moarningsun: The function NearestNeighbors in scikit. In the set up that I tested they are doing exactly the same: calculating kNN for n euclidean vectors, only that my vectors are "wrapped" by a class. – Roy Oct 16 '14 at 09:07
  • Using Numpy functions is the solution instead of list comprehension such as `d = [dist...` and `idx = sorted(....` Converting to Numpy depends on what `distFunc` is. If you provide the full source, you can get a better answer. – emesday Oct 16 '14 at 09:39
  • @mskimm: I've added the relevant code. thanks. – Roy Oct 16 '14 at 09:41
  • 1
    Just curious, when you are using the `sklearn.NearestNeighbors` class, are you also using the `algorithm='ball_tree'` or `algorithm='kd_tree'` options? These will speed up the algorithm quite a bit, but only if you do repeated queries. I'm also not sure whether this is the point of your question, so wanted to make this a comment first. – lmjohns3 Oct 17 '14 at 02:25
  • @lmjohns3: I use the option 'brute' to simulate my own algorithm – Roy Oct 17 '14 at 08:31
  • The factor of 60 is between what and what ? `cdist(r1, r2, 'euclidean').min()` seems standard but it's far from obtimal in practice for large number of points, can you tell us a bit more about the size of your sets ? – Ara Oct 17 '14 at 20:22
  • @Ara: Between the running time of my kNN function and sklearn's function. I tested with n=2000 and p=2 but I intend to use larger data sets, even n=10000. What would you suggest instead cdist? – Roy Oct 18 '14 at 07:03
  • I was thinking of using some algorithm based on spatial division, for instance something along the lines of http://en.wikipedia.org/wiki/Closest_pair_of_points_problem . I'll look a bit into it (I'm not a geometry specialist and I don't know any particular algorithm on this problem). – Ara Oct 20 '14 at 11:49
  • This paper: http://www-cgrl.cs.mcgill.ca/~godfried/publications/mindist.pdf (don't ask me why it is reversed ...), describes an O(n log(n)) algorithm. – Ara Oct 20 '14 at 14:18
  • I added a method based on space partitionning to compute the distance in my answer, and I don't think you should give up so fast on python. Still we can't tell you any ting very relevant about what is slow without any profiling, and it would be the same in Java or another language. – Ara Oct 21 '14 at 18:00

2 Answers2

2

You're just running into the slowness of Python (or rather the CPython interpreter I should say I guess). From wikipedia:

NumPy targets the CPython reference implementation of Python, which is a non-optimizing bytecode compiler/interpreter. Mathematical algorithms written for this version of Python often run much slower than compiled equivalents. NumPy seeks to address this problem by providing multidimensional arrays and functions and operators that operate efficiently on arrays. Thus any algorithm that can be expressed primarily as operations on arrays and matrices can run almost as quickly as the equivalent C code.

And from the Scipy FAQ:

Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate. However, they have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element. This also means that very few list operations can be carried out by efficient C loops – each iteration would require type checks and other Python API bookkeeping.

Note this doesn't concern only Python; for more background see e.g. this and this question on SO.

Due to the overhead from the dynamic type system and the interpreter, Python would be a lot less usefull for high performance number crunching, if it wouldn't be able to tap into all sorts of compiled C and Fortran libraries (e.g. Numpy). Also, there are JIT compilers like Numba and PyPy that try to get Python code to execute closer to the speeds of statically typed, compiled code.

Bottomline: You're doing to much in plain Python relative to the work that you're offloading to fast C code. I suppose you need to adopt more like an "array oriented" coding style rather than object oriented to achieve good performance with Numpy (MATLAB is a very similar story in this regard). On the other hand, if you would use a more efficient algorithm (see the answer by Ara) then the slowness of Python might not be such an issue.

Community
  • 1
  • 1
  • Thanks. Would Java perform better for that matter? – Roy Oct 20 '14 at 14:05
  • I don't believe there is much actual computation done by python here. Can the op give us some profiling information ? For instance by runnning python -m cProfile -s time YOU_PROGRAM.py and post the result on pastebin ? – Ara Oct 20 '14 at 14:28
  • @Roy, I'm not at all familiar with Java. –  Oct 20 '14 at 14:41
  • @Ara, exactly. Almost all the time is spent on overhead: Reference counting, function calls, all kind of lookups, etc. All of this takes up to 60 times (timing by OP) as long as doing the actual calculation at C speed! For more demonstration of this difference see maybe [here](http://continuum.io/blog/numba_performance) or [here.](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/NumbaCython.ipynb) –  Oct 20 '14 at 14:52
  • I don't neccesarily agree that most time is spend in overhead here, it could also be on numpy : we are creating a 2000*2000 matrix in numpy just to find it's minimum ! Anyway some profiling information from @Roy would be great. – Ara Oct 21 '14 at 14:34
  • @Ara, not exactly. OP is calling `cdist` 2000*2000 times, creating a 2*2 array each time. This way the calculation per function call is tiny compared to everything else that's going on (the function calls, the loops, the class attribute lookups, the lists..). At least that's what the code says, in the description he does mention sets of vectors.. –  Oct 21 '14 at 21:54
  • My understanding was that sample were vector of 2000 elements of dimension 2, but indeed it was `n = len(otherSamples)` and not `n = len(sample)` so I completly misjudged the problem ... You must be right: the overhead of allocating memory for a numpy matrix 4000000 times should completly kill the process. – Ara Oct 21 '14 at 22:15
0

Here are the points I can think of:

  • you are computing the distance between one sample and every other every time you call closestK so you compute the distance between every sample twice (once distance(a,b) then distance(b,a)), this can be avoided by computing it once and for all
  • you recompute r (that might involve a costfull vstack) 2 * (n - 1) times where n is len(sample), you can also compute it once and for all (and store it as an attribute of myCluster ?).
  • you compute a sort on the full list while you only need the top-k (no need to sort after the k-th element)
  • to compute the minium distance between the points of your set you create a matrix containing every distance then take it's min: you can certainly do better

My suggestion would be to implement a top-k class with an insert method that insert only when you are better than the current k-th element (and remove it) and to modify myCluster to include r. Then your code might look like

kNN = {i : TopK() for i in xrange(len(samples))}
for i, sample1 in enumerate(samples):
    for j, sample2 in enumerate(samples[:i]):
        dist = distanceRepr(sample1, sample2)
        kNN[i].insert(j, -dist)
        kNN[j].insert(i, -dist)
return kNN

Here is a possible implementation ok TopK:

import heapq

class TopK:
    def __init__(self, k):
        self.k = k
        self.content = []

    def insert (self, key, score):
        if len(self.content) < self.k:
            heapq.heappush(self.content, (score, key))
        else:
            heapq.heappushpop(self.content, (score, key))

    def get_keys(self):
        return [elem[1] for elem in self.content]

For distanceRepr you can use something like:

import scipy.spatial

def distanceRepr(set0 ,set1):
    if len(set0) < len(set1):
        min_set = set0
        max_set = set1
    else:
        min_set = set1
        max_set = set0
    if len(min_set) == 0:
        raise Exception("Empty set")

    min_dist = scipy.inf
    tree = scipy.spatial.cKDTree(max_set)

    for point in min_set:
        distance, _ = tree.query(point, 1, 0., 2, min_dist)
        if min_dist > distance:
            min_dist = min(min_dist, distance)

    return min_dist

It will be faster than your current method for medium and big entries (lets say with sample1 and 2 with a size > 5k), it will also a much smaller memory usage, allowing it to work with big samples (where cdit simply is out of memory).

Ara
  • 325
  • 1
  • 10
  • Thanks. About your first comment, I know but the other option is to save all the distances in the memory and I prefer not to. About the second, please see my second update. About your third, even when I use the scikit with k=n it takes almost the same time as with small k. It seems there's something more fundamental here. – Roy Oct 16 '14 at 12:00
  • As you can see here I don't store more in memory than needed and avoid the recomputation. – Ara Oct 16 '14 at 12:02