I have a Dataset on groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value"))
returns a dataset which has the grouped data for four columns and max
of time column and collect_list
which has all the values for that grouped data like [5,1]
but all I want for _6
is the value matching all the grouped columns and also max("_5").as("time") not only for the grouped columns
Code below:
val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"),("thing1",1,1,"Temperature",1551502200000L,"1"))
import org.apache.spark.sql.functions._
val dataSet = spark.sparkContext.parallelize(data)
import spark.implicits._
val testDS = dataSet.toDS()
testDS.groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value")).show()
Output:
| _1 | _2 | _3 | _4 | time | value |
|thingId1 | 1 | 1 |Temperature | 1551502200000 | [5,1] |
Required Output
| _1 | _2 | _3 | _4 | time | value |
|thingId1 | 1 | 1 |Temperature | 1551502200000 | 1 |
I don't want value 5 to be in the value column as its not comes under the criteria max("time")
all I need is 1 in value column as the it only matches the conditions of all the grouped columns and max("time")
.
How to achieve this.
Thank you.