Deduplicate Rows with reduceGroups won't find Encoder[Row]

Asked Oct 31 '18 at 14:03

Active Oct 31 '18 at 17:49

Viewed 236 times

val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()
import spark.implicits._

spark.sparkContext.parallelize(List((50, 2), (34, 4))).toDF("cs_p_id", "cs_ed")
  .groupByKey(_.getAs[String]("cs_p_id"))
  .reduceGroups(Seq(_, _).maxBy(_.getAs[Long]("cs_ed")))
  .map(_._2) // Unable to find encoder for type stored in a Dataset.

The above won't compile because the map can't find an implicit Encoder[Row].
Surely I'm not the only guy trying to do this simple operation, so what's the way to go?

Thanks

EDIT:
I found this solution that I can't believe people are doing this way:

tableData
  .groupByKey(_.getAs[String]("cs_p_id"))
  .reduceGroups(Seq(_, _).maxBy(_.getAs[Long]("cs_ed")))
  .map(_._2)(RowEncoder(tableData.schema))

This is not a duplicate of Encoder error while trying to map dataframe row to updated row since I'm trying to just remove duplicates.

edited Oct 31 '18 at 17:49

asked Oct 31 '18 at 14:03

Joan

4,079
2
28
37

There isn't enough code here to reproduce however assuming your local SparkSession instance is bound to variable `spark` then try adding `import spark.implicits._` in the body of your function. – Terry Dactyl Oct 31 '18 at 14:16
Thanks @TerryDactyl for your help. This import is already in scope but it doesn't help. – Joan Oct 31 '18 at 14:19
Can you post a bit more context? – Terry Dactyl Oct 31 '18 at 14:20
See edit with a complete example. Thanks – Joan Oct 31 '18 at 14:29
Possible duplicate of [Encoder error while trying to map dataframe row to updated row](https://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row) – 10465355 Oct 31 '18 at 16:01
There must be a way to do without going through this row encoder madness – Joan Oct 31 '18 at 16:12

Deduplicate Rows with reduceGroups won't find Encoder[Row]

0 Answers0