0

I need to group my rdd by two columns and aggregate the count. I have a function:

def constructDiagnosticFeatureTuple(diagnostic: RDD[Diagnostic])
: RDD[FeatureTuple] = {

    val grouped_patients = diagnostic
      .groupBy(x => (x.patientID, x.code))
      .map(_._2)
      .map{ events =>
        val p_id = events.map(_.patientID).take(1).mkString
        val f_code = events.map(_.code).take(1).mkString
        val count = events.size.toDouble
        ((p_id, f_code), count)
      }
    //should be in form:
    //diagnostic.sparkContext.parallelize(List((("patient", "diagnostics"), 1.0)))
}

At compile time, I am getting an error:

/FeatureConstruction.scala:38:3: type mismatch;
[error]  found   : Unit
[error]  required: org.apache.spark.rdd.RDD[edu.gatech.cse6250.features.FeatureConstruction.FeatureTuple]
[error]     (which expands to)  org.apache.spark.rdd.RDD[((String, String), Double)]
[error]   }
[error]   ^ 

How can I fix it? I red this post: Scala Spark type missmatch found Unit, required rdd.RDD , but I do not use collect(), so, it does not help me.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • 2
    `val grouped_patients = diagnostic...` a **val** _assignment_ has return value `Unit`, you may just omit the `val grouped_patients =` part and only leave your logic, or return the val. - BTW, why you discard the key `map(_._2)` if you latter need it, why don't you just `map { case ((p_id, f_code), events) => ((p_id, f_code), events.size.toDouble)`. - Also, take into account that a `groupBy` is expensive, if you have many records with the same key, it may throw an **OutOfMemoryError**, take a look to `reduceByKey` from **KeyValueRDD**. – Luis Miguel Mejía Suárez Feb 12 '19 at 20:50
  • 3
    As @LuisMiguelMejíaSuárez already pointed out, you assign to `grouped_patients`, but you never return `grouped_patients`. Does [this](https://stackoverflow.com/a/12560532/2707792) answer your question? – Andrey Tyukin Feb 12 '19 at 21:00
  • Thank you, @LuisMiguelMejíaSuárez and @Andrey! Yes, omitting "val grouped_patients =" works! – Ekaterina Tcareva Feb 12 '19 at 21:32

0 Answers0