SPARK aggregate all columns base on one column

Question

For simplicity let's assume that I have the following daraframe:

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

I want to Groupby column X and aggregate by taking min value of Z however I want the Y value to be the adjcent value of min value Z

df.groupBy("X").agg(min("Z"), take_y_according_to_min_z("Y")

Desired output:

col X col Y col Z
A     1     5
B     5     15

Note: If there are more than two min("Z") values I don't care which of the rows we take.

I tried to find something online which is clean and SPARKy. It's really clear to me how I can do it in MapReduce but I can't find a way on SPARK.

I'm working on SPARK 1.6

you can combine y and z in struct and find the min and finally separate them again — Ramesh Maharjan, Apr 18 '18 at 05:48
Yes this was my first approach, as I said I wanted a more clean approach if there is one. Seems like a big effort for simple task — RefiPeretz, Apr 18 '18 at 05:49
@RameshMaharjan to struct groupby aggregate and than explode? — RefiPeretz, Apr 18 '18 at 05:53
nope :) you don't need explode. just use wild card character .* to separate the struct column I am just trying to find a duplicate as I have seen somebody answer the same — Ramesh Maharjan, Apr 18 '18 at 05:55
@Ramesh Maharjan Can you post it as an answer to the question I think I missing something. Assuming that there is more than one column Y — RefiPeretz, Apr 18 '18 at 05:56

Ramesh Maharjan · Accepted Answer · 2018-04-18T06:15:27.630

3

you can simply do

import org.apache.spark.sql.functions._
df.select(col("Col X"), struct("Col Z", "Col Y").as("struct"))
  .groupBy("Col X").agg(min(col("struct")).as("min"))
    .select(col("Col X"), col("min.*"))

and you shall get what you desire

+-----+-----+-----+
|Col X|Col Y|Col Z|
+-----+-----+-----+
|B    |5    |15   |
|A    |1    |5    |
+-----+-----+-----+

edited Apr 18 '18 at 06:15

answered Apr 18 '18 at 06:05

Ramesh Maharjan

41,071
6
69
97

`struct("Col Y", "Col Z")` should be `struct("Col Z", "Col Y")` as minimum has to be from `Col Z` – vindev Apr 18 '18 at 06:14
agg(min(col("struct")).as("min")) How this know that I want the min value of col Z and not Y? – RefiPeretz Apr 18 '18 at 06:14
@vindev So it takes the minimum value from the first cell in the struct? – RefiPeretz Apr 18 '18 at 06:15
thank you @vindev I have corrected – Ramesh Maharjan Apr 18 '18 at 06:16
Yes @RefiPeretz you got it right it takes all elements in struct columns but the first element is sorted the first so Z should come first – Ramesh Maharjan Apr 18 '18 at 06:17

score 1 · Answer 2 · answered Apr 18 '18 at 06:02

1

You can use struct as with the column Y and Z as

df.groupBy("X").agg(min(struct("Z", "Y")).as("min"))
    .select("X", "min.*")

Output:

+---+---+---+
|X  |Z  |Y  |
+---+---+---+
|B  |15 |5  |
|A  |5  |1  |
+---+---+---+

Hope this helps1

answered Apr 18 '18 at 06:02

koiralo

22,594
6
51
72

SPARK aggregate all columns base on one column

2 Answers2