2

I have the below df:

+------+-------+--------+
|student|  vars|observed|
+------+-------+--------+
|  1|   ABC   |      19|
|  1|    ABC   |       1|
|  2|    CDB   |       1|
|  1|    ABC   |       8|
|   3|   XYZ   |       3|
|  1|    ABC   |     389|
|   2|   CDB   |     946|
|  1|    ABC   |     342|
|+------+-------+--------+

I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.

val frequency = df.groupBy($"student", $"vars").count()

This code generates a "count" column with the frequencies BUT losing observed column from the df.

I would like to create a new df as follows without losing "observed" column

+------+-------+--------+------------+
|student|  vars|observed|total_count |
+------+-------+--------+------------+
|  1|   ABC   |        9|22
|  1|    ABC   |       1|22
|  2|    CDB   |       1|7
|  1|    ABC   |       2|22
|   3|   XYZ   |       3|3
|  1|    ABC   |       8|22
|   2|   CDB   |       6|7
|  1|    ABC   |       2|22
|+------+-------+-------+--------------+
jwvh
  • 50,871
  • 7
  • 38
  • 64

2 Answers2

2

You cannot do this directly but there are couple of ways,

  1. You can join original df with count df. check here
  2. You collect the observed column while doing aggregation and explode it again

With explode:

 val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")

scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3      |XYZ |3       |1          |
|2      |CDB |1       |2          |
|2      |CDB |946     |2          |
|1      |ABC |389     |5          |
|1      |ABC |342     |5          |
|1      |ABC |19      |5          |
|1      |ABC |1       |5          |
|1      |ABC |8       |5          |
+-------+----+--------+-----------+
Sathiyan S
  • 1,013
  • 6
  • 13
2

We can use Window functions as well

val windowSpec = Window.partitionBy("student","vars")
val frequency  = df.withColumn("total_count", count(col("student")) over windowSpec)
      .show


+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3      |XYZ |3       |1          |
|2      |CDB |1       |2          |
|2      |CDB |946     |2          |
|1      |ABC |389     |5          |
|1      |ABC |342     |5          |
|1      |ABC |19      |5          |
|1      |ABC |1       |5          |
|1      |ABC |8       |5          |
+-------+----+--------+-----------+
Anup Thomas
  • 198
  • 6