how to rename the Columns Produced by count() function in Scala

Question

I have the below df:

+------+-------+--------+
|student|  vars|observed|
+------+-------+--------+
|  1|   ABC   |      19|
|  1|    ABC   |       1|
|  2|    CDB   |       1|
|  1|    ABC   |       8|
|   3|   XYZ   |       3|
|  1|    ABC   |     389|
|   2|   CDB   |     946|
|  1|    ABC   |     342|
|+------+-------+--------+

I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.

val frequency = df.groupBy($"student", $"vars").count()

This code generates a "count" column with the frequencies BUT losing observed column from the df.

I would like to create a new df as follows without losing "observed" column

+------+-------+--------+------------+
|student|  vars|observed|total_count |
+------+-------+--------+------------+
|  1|   ABC   |        9|22
|  1|    ABC   |       1|22
|  2|    CDB   |       1|7
|  1|    ABC   |       2|22
|   3|   XYZ   |       3|3
|  1|    ABC   |       8|22
|   2|   CDB   |       6|7
|  1|    ABC   |       2|22
|+------+-------+-------+--------------+

How do you want to do that if you're not grouping by `observed`? — Andronicus, Feb 25 '20 at 05:37

score 2 · Accepted Answer · answered Feb 25 '20 at 07:04

You cannot do this directly but there are couple of ways,

You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again

With explode:

 val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")

scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3      |XYZ |3       |1          |
|2      |CDB |1       |2          |
|2      |CDB |946     |2          |
|1      |ABC |389     |5          |
|1      |ABC |342     |5          |
|1      |ABC |19      |5          |
|1      |ABC |1       |5          |
|1      |ABC |8       |5          |
+-------+----+--------+-----------+

score 2 · Answer 2 · answered Feb 26 '20 at 05:06

We can use Window functions as well

val windowSpec = Window.partitionBy("student","vars")
val frequency  = df.withColumn("total_count", count(col("student")) over windowSpec)
      .show


+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3      |XYZ |3       |1          |
|2      |CDB |1       |2          |
|2      |CDB |946     |2          |
|1      |ABC |389     |5          |
|1      |ABC |342     |5          |
|1      |ABC |19      |5          |
|1      |ABC |1       |5          |
|1      |ABC |8       |5          |
+-------+----+--------+-----------+

how to rename the Columns Produced by count() function in Scala

2 Answers2