I'm doing a recommandation system for movies, using the MovieLens datasets available here : http://grouplens.org/datasets/movielens/
To compute this recommandation system, I use the ML library of Flink in scala, and particulalrly the ALS algorithm (org.apache.flink.ml.recommendation.ALS
).
I first map the ratings of the movie into a DataSet[(Int, Int, Double)]
and then create a trainingSet
and a testSet
(see the code below).
My problem is that there is no bug when I'm using the ALS.fit
function with the whole dataset (all the ratings), but if I just remove only one rating, the fit function doesn't work anymore, and I don't understand why.
Do You have any ideas? :)
Code used :
Rating.scala
case class Rating(userId: Int, movieId: Int, rating: Double)
PreProcessing.scala
object PreProcessing {
def getRatings(env : ExecutionEnvironment, ratingsPath : String): DataSet[Rating] = {
env.readCsvFile[(Int, Int, Double)](
ratingsPath, ignoreFirstLine = true,
includedFields = Array(0,1,2)).map{r => new Rating(r._1, r._2, r._3)}
}
Processing.scala
object Processing {
private val ratingsPath: String = "Path_to_ratings.csv"
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val ratings: DataSet[Rating] = PreProcessing.getRatings(env, ratingsPath)
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first(ratings.count().toInt)
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150)
.setTemporaryPath("/tmp/tmpALS")
val parameters = ParameterMap()
.add(ALS.Lambda, 0.01) // After some tests, this value seems to fit the problem
.add(ALS.Seed, 42L)
als.fit(trainingSet, parameters)
}
}
"But if I just remove only one rating"
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first((ratings.count()-1).toInt)
The error :
06/19/2015 15:00:24 CoGroup (CoGroup at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:570))(4/4) switched to FAILED
java.lang.ArrayIndexOutOfBoundsException: 5
at org.apache.flink.ml.recommendation.ALS$BlockRating.apply(ALS.scala:358)
at org.apache.flink.ml.recommendation.ALS$$anon$111.coGroup(ALS.scala:635)
at org.apache.flink.runtime.operators.CoGroupDriver.run(CoGroupDriver.java:152)
...