6

I'm working with a dataset that is 6.4 million samples with 500 dimensions and I'm trying to group it into 200 clusters. I'm limited to 90GB of RAM and when I try to run MiniBatchKmeans from sklearn.cluster, the OS kills the process for using up too much memory.

This is the code:

data = np.loadtxt('temp/data.csv', delimiter=',')
labels = np.genfromtxt('temp/labels', delimiter=',')

kmeans = cluster.MiniBatchKMeans(n_clusters=numClusters, random_state=0).fit(data)
predict = kmeans.predict(data)
Tdata = kmeans.transform(data)

It doesn't get past clustering.

Community
  • 1
  • 1
user1816679
  • 845
  • 3
  • 11
  • 19

2 Answers2

3

The solution is to use sklearn's partial_fit method - not all algorithms has this option, but MiniBatchKMeans has it.

So you can train "partially", but you'll have to split your data and not reading it all in one go, this is can be done with generators, there is many ways to do it, if you use pandas for example, you can use this.

Then, instead of using fit, you should use partial_fit to train.

Or Duan
  • 13,142
  • 6
  • 60
  • 65
0

I think you can try to decrease the precision of your data as well to reduce the amount of allocated memory. Try to use float32 rather than default.

gasoon
  • 775
  • 4
  • 8
  • 14