9

I am trying to do some k-means clustering on a very large matrix.

The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row).

The whole thing does not fit into memory, so I converted it into a sparse ARFF file. But R obviously can't read the sparse ARFF file format. I also have the data as a plain CSV file.

Is there any package available in R for loading such sparse matrices efficiently? I'd then use the regular k-means algorithm from the cluster package to proceed.

Many thanks

movingabout
  • 343
  • 3
  • 10
  • Thanks for the answer! I got another question though :-) I am trying to run bigkmeans with a cluster number of about 2000 e.g "clust <- bigkmeans(mymatrix, centers=2000)" However, I get the following error: Error in 1:(10 + 2^k) : result would be too long a vector Can someone maybe give me a hint what I am doing wrong here? Thanks! – movingabout Jun 18 '10 at 07:49
  • 1
    Original at http://stackoverflow.com/questions/3177827/clustering-on-very-large-sparse-matrix – Andrew Dalke Dec 20 '11 at 20:04

4 Answers4

14

The bigmemory package (or now family of packages -- see their website) used k-means as running example of extended analytics on large data. See in particular the sub-package biganalytics which contains the k-means function.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
1

sparkcl performs sparse hierarchical clustering and sparse k-means clustering This should be good for R-suitable (so - fitting into memory) matrices.

http://cran.r-project.org/web/packages/sparcl/sparcl.pdf

==

For really big matrices, I would try a solution with Apache Spark sparse matrices, and MLlib - still, do not know how experimental it is now:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$

https://spark.apache.org/docs/latest/mllib-clustering.html

MichalO
  • 21
  • 4
1

Please check:

library(foreign)
?read.arff

Cheers.

joran
  • 169,992
  • 32
  • 429
  • 468
0

There's a special SparseM package for R that can hold it efficiently. If that doesn't work, I would try going to a higher performance language, like C.

Olga Mu
  • 908
  • 2
  • 12
  • 23