Add another column after groupBy and agg

Question

I have a df looks like this:

+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
|    3|      3|  600|
|    2|      3|  702|
|    1|      2|  120|
|    2|      5|  200|
|    2|      2|  500|
|    3|      1|  100|
|    3|      5| 2000|
|    3|      4|  122|
|    1|      3| 1200|
|    1|      1| 1000|
+-----+-------+-----+

I want to output the max count of vocabId and the docId it belongs to. I did this: val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")) and got this:

+-------+----------+
|vocabId|    count |
+-------+----------+
|      1|      1000|
|      3|      1200|
|      5|      2000|
|      4|       122|
|      2|       500|
+-------+----------+

How do I add the docId at the front??? It should looks something like this(the order is not important):

+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
|    2|      2|  500|
|    3|      5| 2000|
|    3|      4|  122|
|    1|      3| 1200|
|    1|      1| 1000|
+-----+-------+-----+

The ```docId``` should be the ```vocabId``` with the ```max count``` belongs to. for example the ```max count``` of ```vocabId 1``` is ```1000```, and it belongs to ```docId 1```. — MattNewbie, Jun 02 '22 at 12:17
check [this](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark) out. This is how this kind of thing is done. It uses py-spark, but translates to scala almost 1-1. — Dima, Jun 02 '22 at 12:28

score 0 · Accepted Answer · answered Jun 02 '22 at 12:43

0

You can do self join with docwords over count and vocabId something like below

val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))

answered Jun 02 '22 at 12:43

Anjaneya Tripathi

1,191
1
3
8

Add another column after groupBy and agg

1 Answers1