I'm pretty new to Scala and Spark and I'm not able to create a correlation matrix from a file of ratings. It's similar to this question but I have sparse data in the matrix form. My data looks like this:
<user-id>, <rating-for-movie-1-or-null>, ... <rating-for-movie-n-or-null>
123, , , 3, , 4.5
456, 1, 2, 3, , 4
...
The code that is most promising so far looks like this:
val corTest = sc.textFile("data/collab_filter_data.txt").map(_.split(","))
Statistics.corr(corTest, "pearson")
(I know the user_ids in there are a defect, but I'm willing to live with that for the moment)
I'm expecting output like:
1, .123, .345
.123, 1, .454
.345, .454, 1
It's a matrix showing how each user is correlated to every other user. Graphically, it would be a correlogram.
It's a total noob problem but I've been fighting with it for a few hours and can't seem to Google my way out of it.