How to cluster large amounts of data with minimal memory usage

Question

I am using scipy.cluster.hierarchy.fclusterdata function to cluster a list of vectors (vectors with 384 components).

It works nice, but when I try to cluster large amounts of data I run out of memory and the program crashes.

How can I perform the same task without running out of memory?

My machine has 32GB RAM, Windows 10 x64, python 3.6 (64 bit)

score 0 · Answer 1 · answered Oct 16 '19 at 06:18

0

You'll need to choose a different algorithm.

Hierarchical clustering needs O(n²) memory and the textbook algorithm O(n³) time. This cannot scale well to large data.

answered Oct 16 '19 at 06:18

Has QUIT--Anony-Mousse

76,138
12
138
194

What would you suggest? I just want an algorithm that creates clusters from a list of vectors. I don't want to specify the number of clusters to be formed – Samuel Ferreira Oct 16 '19 at 09:16

score 0 · Answer 2 · answered Oct 17 '19 at 08:42

You could have a look at

DBSCAN (or other density-based algorithms) and perhaps this associated discussion: scikit-learn DBSCAN memory usage
SLINK, a variation of hierarchical clustering (set linkage='single' in sklearn.cluster.AgglomerativeClustering)
MiniBatch K Means
or BIRCH.

However, you will have to set up some pipeline to test different numbers of clusters. It's hard to say which algorithm will work best for you, though.

How to cluster large amounts of data with minimal memory usage

2 Answers2