Invert map and reduceByKey in Spark-Scala

Question

I'm have a CSV dataset that I want to process using Spark, the second column is of this format:

yyyy-MM-dd hh:mm:ss

I want to group each MM-dd

val days : RDD = sc.textFile(<csv file>)

val partitioned = days.map(row => {

    row.split(",")(1).substring(5,10)

}).invertTheMap.groupOrReduceByKey

The result of groupOrReduceByKey is of form:

("MM-dd" -> (row1, row2, row3, ..., row_n) )

How should I implement invertTheMap and groupOrReduceByKey?

I saw this in Python here but I wonder how is it done in Scala?

Have a look at org.apache.spark.sql.SQLContext to make working with CSVs easier — Simon, Oct 22 '15 at 14:07

score 1 · Accepted Answer · answered Oct 22 '15 at 14:13

1

This should do the trick

val testData = List("a, 1987-09-30",
  "a, 2001-09-29",
  "b, 2002-09-30")

val input = sc.parallelize(testData)

val grouped = input.map{
  row =>
    val columns = row.split(",")

    (columns(1).substring(6, 11), row)
}.groupByKey()

grouped.foreach(println)

The output is

(09-29,CompactBuffer(a, 2001-09-29))
(09-30,CompactBuffer(a, 1987-09-30, b, 2002-09-30))

answered Oct 22 '15 at 14:13

Till Rohrmann

13,148
1
25
51

can you share how to reference the CompactBuffer using the key in a general way and convert the CompactBuffer to RDD? Thanks. – oikonomiyaki Oct 22 '15 at 14:31
1

`groupByKey` will return a `RDD[(Key, Iterable[Value])]` where `Key` is `String` and `Value` is `String` as well in my example. From there you can simply continue with your computations. – Till Rohrmann Oct 22 '15 at 14:44

Invert map and reduceByKey in Spark-Scala

1 Answers1