Suppose I ave a Dataset that looks like this:
+--------------------+---------+------+--------------------+
| transID|principal|subSeq| subTransID|
+--------------------+---------+------+--------------------+
|2116e07b-14ea-476...| bob| 4|ec463751-22ca-477...|
|3859a175-f16b-4fd...| bob| 4|ec463751-22ca-477...|
|3859a175-f16b-4fd...| bob| 7|2116e07b-14ea-476...|
+--------------------+---------+------+--------------------+
I want to remove duplicate rows by aggregating the column transID
based on the maximum value of the column subSeq
, but I want to resultant Dataset to show not the max(subSeq)
column, but instead the column subTransID
from the original Dataset.
If I do this:
dsJoin.groupBy("transID").agg(functions.max("subSeq")).show();
Then I get
+--------------------+-----------+
| transID|max(subSeq)|
+--------------------+-----------+
|3859a175-f16b-4fd...| 7|
|2116e07b-14ea-476...| 4|
+--------------------+-----------+
The duplicate row 3859a175-f16b-4fd...
with value 4 in column subSeq
has been correctly removed based on the max value 7 in another row. But I want to have the column subTransID
shown in the resultant Dataset!
I must be missing something very obvious here.
Doing this in JAVA. Thanks for any suggestions!