1

Suppose I ave a Dataset that looks like this:

+--------------------+---------+------+--------------------+
|             transID|principal|subSeq|          subTransID|
+--------------------+---------+------+--------------------+
|2116e07b-14ea-476...|      bob|     4|ec463751-22ca-477...|
|3859a175-f16b-4fd...|      bob|     4|ec463751-22ca-477...|
|3859a175-f16b-4fd...|      bob|     7|2116e07b-14ea-476...|
+--------------------+---------+------+--------------------+

I want to remove duplicate rows by aggregating the column transID based on the maximum value of the column subSeq, but I want to resultant Dataset to show not the max(subSeq) column, but instead the column subTransID from the original Dataset.

If I do this:

dsJoin.groupBy("transID").agg(functions.max("subSeq")).show();

Then I get

+--------------------+-----------+
|             transID|max(subSeq)|
+--------------------+-----------+
|3859a175-f16b-4fd...|          7|
|2116e07b-14ea-476...|          4|
+--------------------+-----------+

The duplicate row 3859a175-f16b-4fd... with value 4 in column subSeq has been correctly removed based on the max value 7 in another row. But I want to have the column subTransID shown in the resultant Dataset!

I must be missing something very obvious here.

Doing this in JAVA. Thanks for any suggestions!

VS_FF
  • 2,353
  • 3
  • 16
  • 34

2 Answers2

3

You should pack the relevant attributes into a struct, apply the aggregate-function and then unpack the struct again ((scala-code below) :

dsJoin.groupBy("transID")
.agg(
     max(struct("subSeq","subTransID")).as("max")
)
.select("transID","max.*")
.show()
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • So Spark is finding the `max` only on the first element of the `struct` while the other elements in the `struct` play no role for the max? – Björn Jacobs Jan 27 '22 at 16:31
  • 2
    @BjörnJacobs If the first element ist equal, then the second element of the struct will also play a role – Raphael Roth Jan 28 '22 at 10:16
1

in the agg expression also get first from the others fields

dsJoin.groupBy("transID").agg(functions.max("subSeq"),functions.first("principal")).show();
Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68