Is there a way to reduce memory usage of mini-batch kmeans?

Question

I'm working with a dataset that is 6.4 million samples with 500 dimensions and I'm trying to group it into 200 clusters. I'm limited to 90GB of RAM and when I try to run MiniBatchKmeans from sklearn.cluster, the OS kills the process for using up too much memory.

This is the code:

data = np.loadtxt('temp/data.csv', delimiter=',')
labels = np.genfromtxt('temp/labels', delimiter=',')

kmeans = cluster.MiniBatchKMeans(n_clusters=numClusters, random_state=0).fit(data)
predict = kmeans.predict(data)
Tdata = kmeans.transform(data)

It doesn't get past clustering.

Sorry, my previous answer was wrong. I missed that you were not using the GPU, but actually running out of main RAM. — Matthias Winkelmann, Apr 12 '17 at 12:53
I do have GPU access. Do you think I could rewrite this to work on a GPU without using up so much memory? — user1816679, Apr 12 '17 at 14:10
No, what you want to do is lazy-load the data file, i. e. piece by piece. I know it's possible and not terribly difficult, but can't remember specifics right now, unfortunately. — Matthias Winkelmann, Apr 12 '17 at 14:55

Or Duan · Answer 1 · 2018-01-05T13:20:05.540

3

The solution is to use sklearn's partial_fit method - not all algorithms has this option, but MiniBatchKMeans has it.

So you can train "partially", but you'll have to split your data and not reading it all in one go, this is can be done with generators, there is many ways to do it, if you use pandas for example, you can use this.

Then, instead of using fit, you should use partial_fit to train.

edited Jan 05 '18 at 13:20

answered Sep 16 '17 at 08:47

Or Duan

13,142
6
60
65

score 0 · Answer 2 · answered Jul 02 '20 at 02:22

0

I think you can try to decrease the precision of your data as well to reduce the amount of allocated memory. Try to use float32 rather than default.

answered Jul 02 '20 at 02:22

gasoon

775
4
8
14

Is there a way to reduce memory usage of mini-batch kmeans?

2 Answers2