I have a DataFrame
of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame
where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame
where most of the values are null
due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix
(for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null
values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0)
but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors
somehow (to ignore null
values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.