I need to select the last 'name' for the given 'id'. A possible solution could be the following:
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.groupBy($"id")
.agg(last("name"))
I don't know if it's correct because I'm not sure that orderBy
is kept after doing groupBy
.
But it's certainly not a performant solution. Probably I should use reduceByKey
. I tried the following in the spark shell and it works
val x = sc.parallelize(Array(("1", "T1"), ("2", "T2"), ("1", "T11"), ("1", "T111"), ("2", "T22"), ("1", "T100"), ("2", "T222"), ("2", "T200")), 3)
x.reduceByKey((acc,x) => x).collect
But it doesn't work with my dataframe.
case class ChannelRecord(id: Long, name: String)
val channels = sessions
.select($"start_time", $"id", $"name")
.orderBy($"start_time")
.select($"id", $"name")
.as[ChannelRecord]
.reduceByKey((acc, x) => x) // take the last object
I got a compilation error: value reduceByKey is not a member of org.apache.spark.sql.Dataset
I think I should add a map()
call before doing reduceByKey
but I cannot figure out what should I map.