1

For simplicity let's assume that I have the following daraframe:

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

I want to Groupby column X and aggregate by taking min value of Z however I want the Y value to be the adjcent value of min value Z

df.groupBy("X").agg(min("Z"), take_y_according_to_min_z("Y")

Desired output:

col X col Y col Z
A     1     5
B     5     15

Note: If there are more than two min("Z") values I don't care which of the rows we take.

I tried to find something online which is clean and SPARKy. It's really clear to me how I can do it in MapReduce but I can't find a way on SPARK.

I'm working on SPARK 1.6

koiralo
  • 22,594
  • 6
  • 51
  • 72
RefiPeretz
  • 543
  • 5
  • 19

2 Answers2

3

you can simply do

import org.apache.spark.sql.functions._
df.select(col("Col X"), struct("Col Z", "Col Y").as("struct"))
  .groupBy("Col X").agg(min(col("struct")).as("min"))
    .select(col("Col X"), col("min.*"))

and you shall get what you desire

+-----+-----+-----+
|Col X|Col Y|Col Z|
+-----+-----+-----+
|B    |5    |15   |
|A    |1    |5    |
+-----+-----+-----+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
1

You can use struct as with the column Y and Z as

df.groupBy("X").agg(min(struct("Z", "Y")).as("min"))
    .select("X", "min.*")

Output:

+---+---+---+
|X  |Z  |Y  |
+---+---+---+
|B  |15 |5  |
|A  |5  |1  |
+---+---+---+

Hope this helps1

koiralo
  • 22,594
  • 6
  • 51
  • 72