2

I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:

def dist(a,b):
    a = a[np.newaxis,:,:]
    b = b[:,np.newaxis,:]
    dist = np.linalg.norm((a-b), axis=2) 
    return dist

Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?

LunarLlama
  • 229
  • 2
  • 11
  • Note, you're reshaping operations can be more succinctly articulated as `a[None,:,:]` and `b[:,None,:]`, or less arcanely, you can use `np.newaxis` like so: `b[:,np.newaxis,:]` – juanpa.arrivillaga Jan 14 '18 at 21:57
  • So the final array `a-b` will contain `42000 * 784 * K` elements. Supposing you use `float64`s (8bytes per value) that will be `251 * K` MB. For an imaginary K = 15 that's ~3.7 GB. `np.linalg.norm` may just use a bit of extra memory but (hopefully) not that much. The really interesting question now is: How much RAM do you have? – MSeifert Jan 14 '18 at 21:58
  • I have 8GB of RAM, so it should work. I'd like to be able to scale it up to something along the lines of K=256, is there anyway to do this? Closing everything else running will get K=20 to work, but it takes very long. Checking RAM usage shows around 7.5GB used. – LunarLlama Jan 14 '18 at 22:27
  • @juanpa.arrivillaga I fixed my reshaping, does that reduce memory usage? – LunarLlama Jan 14 '18 at 22:39
  • @LunarLlama no, it was an aside. – juanpa.arrivillaga Jan 14 '18 at 22:49

2 Answers2

2

scipy has built-in functions for distance computations, which are lightning fast compared to home made implementations.

So, the first idea is to replace your whole distance function by the following expression:

from numpy.random import rand
from scipy.spatial import distance

# sample data
a = randn(42000, 784
b = randn(256, 784)

# distance computation
dist = distance.cdist(a, b, metric='euclidean')    # about 8.02 s on 
                                                   # my 8 GB RAM machine

Note that dist in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T.

It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean').

This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.

The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn:

from sklear.cluster import KMeans
a = randn(4200, 200)

km = KMeans(n_clusters=256)
km.fit(a)                    # about 10 s

One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.

MaxPowers
  • 5,235
  • 2
  • 44
  • 69
1

As an alternative way, using numba accelerator in nopython parallel mode, can be one of the fastest methods. I have compared performances of numba and cdist on various array sizes, both consume near the same time (e.g. both takes 8 second on my machine), perhaps numba beats cdist on smaller arrays:

import numba as nb

@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def distances_numba(a, b):
    arr = np.zeros((a.shape[0], b.shape[0]), dtype=np.float64)
    temp_arr = np.zeros_like(arr)

    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(a.shape[1]):
                temp_arr[i, j] += (a[i, k] - b[j, k]) ** 2
            arr[i, j] = temp_arr[i, j] ** 0.5

    return arr

I did not compare memory consumptions, but I think numba will be one of the best in this regard.

Update

We can parallelize the max9111 answer, that was 10-20 times faster than cdist or my first solution. The parallelization makes the max9111 solution faster around 1.5 to 2 times. The benchmarks are based on some of my tests and need more evaluations.

@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def first(A):
    TMP_A = np.zeros(A.shape[0], dtype=np.float64)
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            TMP_A[i] += A[i, j] ** 2
    return TMP_A


@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def second(B):
    TMP_B = np.zeros(B.shape[0], dtype=np.float64)
    for i in nb.prange(B.shape[0]):
        for j in range(B.shape[1]):
            TMP_B[i] += B[i, j] ** 2
    return TMP_B


@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def calc_dist_p(A, B):
    dist = np.dot(A, B.T)
    TMP_A = first(A)
    TMP_B = second(B)
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            dist[i, j] = (-2. * dist[i, j] + TMP_A[i] + TMP_B[j]) ** 0.5
    return dist
Ali_Sh
  • 2,667
  • 3
  • 43
  • 66
  • 1
    You can beat cdist by more than a factor of 50 if you change the algorithm by using the following relation (a-b)^2 = a^2 + b^2 - 2ab (less precise): https://stackoverflow.com/a/53380192/4045774 The idea is to do most of the calculation in a highly optimized BLAS routine. But this is only the case if the size a.shape[1] is relatively large (10-20 and upwards) – max9111 Jun 10 '22 at 08:45
  • @max9111 , Your answer was much better than my first solution and `cdist` (+1). Based on your solution, I proposed a second solution with parallelizing your code. Now it is faster and may can be a little faster by some tricks, but not very much. – Ali_Sh Jun 10 '22 at 13:10