Spark: use value of a groupBy column as a name for an aggregate column

Question

I want to give aggregate column name which contains a value of one of the groupBy columns:

dataset
   .groupBy("user", "action")
   .agg(collect_list("timestamp").name($"action" + "timestamps")

this part: .name($"action") does not work because name expects a String, not a Column.

are you looking for pivot? `groupBy` can return multiple distinct value for `action` — undefined_variable, Feb 10 '20 at 12:44
@ArtemAliev I don't think it's a duplicate — it's not obvious to a typical Spark beginner that pivot can be used to solve the column renaming problem. — ptkvsk, Feb 10 '20 at 15:09
This question doesn't deserve a down-vote but it's definitely a dupe. — eliasah, Feb 10 '20 at 17:59

score 1 · Accepted Answer · answered Feb 10 '20 at 13:53

Base on: How to pivot Spark DataFrame?

val df = spark.createDataFrame(Seq(("U1","a",1), ("U2","b",2))).toDF("user", "action", "timestamp")

val res = df.groupBy("user", "action").pivot("action").agg(collect_list("timestamp"))
res.show()

+----+------+---+---+
|user|action|  a|  b|
+----+------+---+---+ 
|  U1|     a|[1]| []|
|  U2|     b| []|[2]|
+----+------+---+---+

Fun part with column renaming. We should rename all but first 2 columns

val renames = res.schema.names.drop(2).map (n => col(n).as(n + "_timestamp"))
res.select((col("user") +: renames): _*).show


+----+-----------+-----------+
|user|a_timestamp|b_timestamp|
+----+-----------+-----------+
|  U1|        [1]|         []|
|  U2|         []|        [2]|
+----+-----------+-----------+

Thanks! Your renaming part is pretty complicated but nice. I went with a simpler approach via `withColumnRenamed("a", "a_timestamp").withColumnRenamed("b", "b_timestamp")` but it's only viable when pivoted values are predefined. — ptkvsk, Feb 10 '20 at 15:07

Spark: use value of a groupBy column as a name for an aggregate column

1 Answers1