5

My question is equivalent to R-related post Create Sparse Matrix from a data frame, except that I would like to perform the same thing on Spark (preferably in Scala).

Sample of data in the data.txt file from which the sparse matrix is being created:

UserID MovieID  Rating
2      1       1
3      2       1
4      2       1
6      2       1
7      2       1

So in the end the columns are the movie IDs and the rows are the user IDs

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   1   0   0   0   0   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   1   0   0   0   0   0
7   0   1   0   0   0   0   0

I've actually started by doing a map RDD transformation on the data.txt file (without the headers) to convert values into Integer, but then ... I could not find a function for sparse matrix creation.

val data = sc.textFile("/data/data.txt")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
    Rating(user.toInt, item.toInt, rate.toInt)
  })
...?
Community
  • 1
  • 1
guzu92
  • 737
  • 1
  • 12
  • 28

1 Answers1

8

The simplest way is to map Ratings to MatrixEntries an create CoordinateMatrix:

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val mat = new CoordinateMatrix(ratings.map {
    case Rating(user, movie, rating) => MatrixEntry(user, movie, rating)
})

CoordinateMatrix can be further converted to BlockMatrix, IndexedRowMatrix, RowMatrix using toBlockMatrix, toIndexedRowMatrix, toRowMatrix respectively.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • I am doing something similar with Python: `mat = CoordinateMatrix(groupped_df.rdd.map(lambda r: MatrixEntry(r.userId, r.itemId, r.rating)))`. But this does not account for duplicate items. So, my unique users are 4300, my unique items are 2800, and in the end, I get a Coordinate Matrix that is 18300 by 90200. So, is this a normal behavior, and how can I remove duplicates? – Dimitris Poulopoulos Jun 29 '17 at 07:30
  • 1
    @JimVer I'd say it depends on the business logic. If you have duplicate ratings for the same user, it probably makes sense to summarize (average, median, take the most recent one) your data first. – zero323 Jun 29 '17 at 20:31