0

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.

I know that Spark supports PCA and also PCA + Kmeans.

However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.

Edward J. Stembler
  • 1,932
  • 4
  • 30
  • 53
  • 1
    You can read your file via `textFile`, see [this question](http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files) and then process resulting RDD as you wish – Odomontois Jun 10 '15 at 14:23
  • @Odomontois Yeah, I also found [that question](http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files) shortly after posting mine. I wasn't sure if that was the idiomatic way of doing it in Spark. Have you done that before in Spark? – Edward J. Stembler Jun 10 '15 at 14:27
  • I'm doing code like `sc.textFile(files mkString ",")` pretty often – Odomontois Jun 10 '15 at 14:35

0 Answers0