k-means clustering in R on very large, sparse matrix?

Question

I am trying to do some k-means clustering on a very large matrix.

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

Many thanks

Thanks for the answer! I got another question though :-) I am trying to run bigkmeans with a cluster number of about 2000 e.g "clust <- bigkmeans(mymatrix, centers=2000)" However, I get the following error: Error in 1:(10 + 2^k) : result would be too long a vector Can someone maybe give me a hint what I am doing wrong here? Thanks! — movingabout, Jun 18 '10 at 07:49
Original at http://stackoverflow.com/questions/3177827/clustering-on-very-large-sparse-matrix — Andrew Dalke, Dec 20 '11 at 20:04

score 14 · Answer 1 · answered Jun 14 '10 at 18:10

14

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

answered Jun 14 '10 at 18:10

Dirk Eddelbuettel

360,940
56
644
725

+1 for big memory, i had no idea that they had so many packages. – richiemorrisroe Jun 03 '11 at 20:34
Yes and the read.data.matrix() function from bigmemory package supports 1 atomic data type. – Scott Davis Jun 13 '14 at 16:21

score 1 · Answer 2 · answered May 20 '15 at 10:50

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

==

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/mllib-clustering.html

score 1 · Answer 3 · edited Dec 01 '11 at 00:45

1

Please check:

library(foreign)
?read.arff

Cheers.

edited Dec 01 '11 at 00:45

joran

169,992
32
429
468

answered Jun 03 '11 at 16:03

Freddy López

36
1

score 0 · Answer 4 · answered May 31 '13 at 23:21

0

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

answered May 31 '13 at 23:21

Olga Mu

908
2
12
23

k-means clustering in R on very large, sparse matrix?

4 Answers4

Linked