How to do distributed Principal Components Analysis + Kmeans using Apache Spark?

Asked Jun 10 '15 at 13:25

Active Jul 27 '15 at 12:28

Viewed 828 times

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.

I know that Spark supports PCA and also PCA + Kmeans.

However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.

asked Jun 10 '15 at 13:25

1

You can read your file via `textFile`, see [this question](http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files) and then process resulting RDD as you wish – Odomontois Jun 10 '15 at 14:23
@Odomontois Yeah, I also found [that question](http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files) shortly after posting mine. I wasn't sure if that was the idiomatic way of doing it in Spark. Have you done that before in Spark? – Edward J. Stembler Jun 10 '15 at 14:27
I'm doing code like `sc.textFile(files mkString ",")` pretty often – Odomontois Jun 10 '15 at 14:35

0 Answers0