How to cluster big data using Python or R without memory error?

Question

I am trying to cluster a data set with about 1,100,000 observations, each with three values.

The code is pretty simple in R:

df11.dist <-dist(df11cl) , where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.

the error I get is : Error: cannot allocate vector of size 4439.0 Gb

Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.

I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?

Can you recommend anything in either R or python?

Please add your full error traceback and the code related to the problem. — Klaus D., Oct 26 '19 at 01:47

score 4 · Accepted Answer · answered Oct 27 '19 at 09:59

For trivial reasons, the function dist needs quadratic memory.

So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.

And 1 million is still pretty small, isn't it?

Using dist on big data is impossible. End of story.

For larger data sets, you'll need to

use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller

In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.

How to cluster big data using Python or R without memory error?

1 Answers1