Spark - How to create a sparse matrix from item ratings

Question

My question is equivalent to R-related post Create Sparse Matrix from a data frame, except that I would like to perform the same thing on Spark (preferably in Scala).

Sample of data in the data.txt file from which the sparse matrix is being created:

UserID MovieID  Rating
2      1       1
3      2       1
4      2       1
6      2       1
7      2       1

So in the end the columns are the movie IDs and the rows are the user IDs

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   1   0   0   0   0   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   1   0   0   0   0   0
7   0   1   0   0   0   0   0

I've actually started by doing a map RDD transformation on the data.txt file (without the headers) to convert values into Integer, but then ... I could not find a function for sparse matrix creation.

val data = sc.textFile("/data/data.txt")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
    Rating(user.toInt, item.toInt, rate.toInt)
  })
...?

zero323 · Accepted Answer · 2015-09-04T17:14:40.307

8

The simplest way is to map Ratings to MatrixEntries an create CoordinateMatrix:

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val mat = new CoordinateMatrix(ratings.map {
    case Rating(user, movie, rating) => MatrixEntry(user, movie, rating)
})

CoordinateMatrix can be further converted to BlockMatrix, IndexedRowMatrix, RowMatrix using toBlockMatrix, toIndexedRowMatrix, toRowMatrix respectively.

edited Sep 04 '15 at 17:14

answered Sep 04 '15 at 17:04

zero323

322,348
103
959
935

I am doing something similar with Python: `mat = CoordinateMatrix(groupped_df.rdd.map(lambda r: MatrixEntry(r.userId, r.itemId, r.rating)))`. But this does not account for duplicate items. So, my unique users are 4300, my unique items are 2800, and in the end, I get a Coordinate Matrix that is 18300 by 90200. So, is this a normal behavior, and how can I remove duplicates? – Dimitris Poulopoulos Jun 29 '17 at 07:30
1

@JimVer I'd say it depends on the business logic. If you have duplicate ratings for the same user, it probably makes sense to summarize (average, median, take the most recent one) your data first. – zero323 Jun 29 '17 at 20:31

Spark - How to create a sparse matrix from item ratings

1 Answers1

Linked