Include other columns when doing groupBy and agg in SparkSQL

Question

Suppose I ave a Dataset that looks like this:

+--------------------+---------+------+--------------------+
|             transID|principal|subSeq|          subTransID|
+--------------------+---------+------+--------------------+
|2116e07b-14ea-476...|      bob|     4|ec463751-22ca-477...|
|3859a175-f16b-4fd...|      bob|     4|ec463751-22ca-477...|
|3859a175-f16b-4fd...|      bob|     7|2116e07b-14ea-476...|
+--------------------+---------+------+--------------------+

I want to remove duplicate rows by aggregating the column transID based on the maximum value of the column subSeq, but I want to resultant Dataset to show not the max(subSeq) column, but instead the column subTransID from the original Dataset.

If I do this:

dsJoin.groupBy("transID").agg(functions.max("subSeq")).show();

Then I get

+--------------------+-----------+
|             transID|max(subSeq)|
+--------------------+-----------+
|3859a175-f16b-4fd...|          7|
|2116e07b-14ea-476...|          4|
+--------------------+-----------+

The duplicate row 3859a175-f16b-4fd... with value 4 in column subSeq has been correctly removed based on the max value 7 in another row. But I want to have the column subTransID shown in the resultant Dataset!

I must be missing something very obvious here.

Doing this in JAVA. Thanks for any suggestions!

score 3 · Answer 1 · answered Aug 17 '18 at 13:38

3

You should pack the relevant attributes into a struct, apply the aggregate-function and then unpack the struct again ((scala-code below) :

dsJoin.groupBy("transID")
.agg(
     max(struct("subSeq","subTransID")).as("max")
)
.select("transID","max.*")
.show()

answered Aug 17 '18 at 13:38

Raphael Roth

26,751
15
88
145

So Spark is finding the `max` only on the first element of the `struct` while the other elements in the `struct` play no role for the max? – Björn Jacobs Jan 27 '22 at 16:31
2

@BjörnJacobs If the first element ist equal, then the second element of the struct will also play a role – Raphael Roth Jan 28 '22 at 10:16

score 1 · Accepted Answer · answered Aug 17 '18 at 13:00

1

in the agg expression also get first from the others fields

dsJoin.groupBy("transID").agg(functions.max("subSeq"),functions.first("principal")).show();

answered Aug 17 '18 at 13:00

Arnon Rotem-Gal-Oz

25,469
3
45
68

Thank you, that did it -- except it looks like what I need here is not .first() but .last()? Anyway, this way it seems to do what I need it to do. – VS_FF Aug 17 '18 at 13:06
If "other" values are the same it doesn't matter which copy you take – Arnon Rotem-Gal-Oz Aug 17 '18 at 13:17
this only works if all `subTransID` per `transID` are the same, I'm not sure if thats the case here – Raphael Roth Aug 17 '18 at 13:40

Include other columns when doing groupBy and agg in SparkSQL

2 Answers2