I am trying to cluster a data set with about 1,100,000 observations, each with three values.
The code is pretty simple in R
:
df11.dist <-dist(df11cl)
, where df11cl is a dataframe with three columns and 1,100,000 rows and all the values in this data frame are standardized.
the error I get is :
Error: cannot allocate vector of size 4439.0 Gb
Recommendations on similar problems include increasing RAM or chunking data. I already have 64GB RAM and my virtual memory is 171GB, so I don't think increasing RAM is a feasible solution. Also, as far as I know, chunked data in hierarchical data analysis yields different results. So, it seems using a sample of data is out of question.
I have also found this solution, but the answers actually alter the question. They technically advise k-means. K-means could work if one knows the number of clusters beforehand. I do not know the number of clusters. That said, I ran k-means using different number of clusters, but now I don't know how to justify the selection of one to another. Is there any test that can help?
Can you recommend anything in either R
or python
?