0

I have a df looks like this:

+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
|    3|      3|  600|
|    2|      3|  702|
|    1|      2|  120|
|    2|      5|  200|
|    2|      2|  500|
|    3|      1|  100|
|    3|      5| 2000|
|    3|      4|  122|
|    1|      3| 1200|
|    1|      1| 1000|
+-----+-------+-----+

I want to output the max count of vocabId and the docId it belongs to. I did this: val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")) and got this:

+-------+----------+
|vocabId|    count |
+-------+----------+
|      1|      1000|
|      3|      1200|
|      5|      2000|
|      4|       122|
|      2|       500|
+-------+----------+

How do I add the docId at the front??? It should looks something like this(the order is not important):

+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
|    2|      2|  500|
|    3|      5| 2000|
|    3|      4|  122|
|    1|      3| 1200|
|    1|      1| 1000|
+-----+-------+-----+
Vincent Doba
  • 4,343
  • 3
  • 22
  • 42
  • 1
    How do you decide which `docId` to add? – Dima Jun 02 '22 at 12:00
  • The ```docId``` should be the ```vocabId``` with the ```max count``` belongs to. for example the ```max count``` of ```vocabId 1``` is ```1000```, and it belongs to ```docId 1```. – MattNewbie Jun 02 '22 at 12:17
  • check [this](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark) out. This is how this kind of thing is done. It uses py-spark, but translates to scala almost 1-1. – Dima Jun 02 '22 at 12:28

1 Answers1

0

You can do self join with docwords over count and vocabId something like below

val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))
Anjaneya Tripathi
  • 1,191
  • 1
  • 3
  • 8