I have a dataframe with a array column.
val json = """[
{"id": 1, "value": [11, 12, 18]},
{"id": 2, "value": [23, 21, 29]}
]"""
val df = spark.read.json(Seq(json).toDS)
scala> df.show
+---+------------+
| id| value|
+---+------------+
| 1|[11, 12, 18]|
| 2|[23, 21, 29]|
+---+------------+
Now I need to apply different aggregate functions to the value column.
I can call explode
and groupBy
, for example
df.select($"id", explode($"value").as("value")).groupBy($"id").agg(max("value"), avg("value")).show
+---+----------+------------------+
| id|max(value)| avg(value)|
+---+----------+------------------+
| 1| 18|13.666666666666666|
| 2| 29|24.333333333333332|
+---+----------+------------------+
What bothers me here is that I explode my DataFrame into a bigger one and then reduce it to the original calling groupBy
.
Is there a better (i.e. more efficient) way to call aggregated functions on array column? Probably I can implement UDF but I don't want to implement all aggregation UDFs myself.
EDIT. Someone referenced this SO question but it doesn't work in my case.
The size
is working fine
scala> df.select($"id", size($"value")).show
+---+-----------+
| id|size(value)|
+---+-----------+
| 1| 3|
| 2| 3|
+---+-----------+
But avg
or max
do not work.