Getting the least set of rows in a groupby of a pyspark dataframe

Question

I have a dataframe with values

#+-------+---------+-----+
#|name1  |name 2   |score|
#+-------+---------+-----+
#| abcdef| abcghi  |    3|
#| abcdef| abcjkl  |    3|
#| abcdef| abcyui  |    3|
#| abcdef| abrtyu  |    4|
#| pqrstu| pqrswe  |    2|
#| pqrstu| pqrsqw  |    2|
#| pqrstu| pqrzxc  |    3|
#+-------+---------+-----+

I need to group by name1 and pick the rows with the least score.

I understand I can pick the top row after a groupby on name1 and sort the score in ascending order and pick the first row. I do this by

joined_windows = Window().partitionBy("name1").orderBy(col("score").asc())
result = joined_df.withColumn("rn", row_number().over(joined_windows)).where(col("rn") == 1).drop("rn")

But I want the dataframe to hold the following values (ie., set of rows with the least score in each group.

#+-------+---------+-----+
#|name1  |name 2   |score|
#+-------+---------+-----+
#| abcdef| abcghi  |    3|
#| abcdef| abcjkl  |    3|
#| abcdef| abcyui  |    3|
#| pqrstu| pqrswe  |    2|
#| pqrstu| pqrsqw  |    2|
#+-------+---------+-----+

score 1 · Accepted Answer · answered Sep 19 '18 at 12:22

1

For hold several values such code can be used:

val joined_windows = Window.partitionBy("name1")
val result = df.withColumn("rn", min($"score").over(joined_windows))
result.where($"rn"===$"score").drop("rn").show(false)

Output:

+------+------+-----+
|name1 |name 2|score|
+------+------+-----+
|abcdef|abcghi|3    |
|abcdef|abcjkl|3    |
|abcdef|abcyui|3    |
|pqrstu|pqrswe|2    |
|pqrstu|pqrsqw|2    |
+------+------+-----+

answered Sep 19 '18 at 12:22

pasha701

6,831
1
15
22

Says the column Score is not iterable. in the second line of your code – Mohammed Ashiq Sep 19 '18 at 12:48
Scala code provided, $"score" means the same as col("score") – pasha701 Sep 19 '18 at 12:50
Sorry Pasha. You are absolutely right. The reason was I was using min function which was overwritten by python. Could fix that error by `from pyspark.sql.functions import min as sparkMin` and using sparkMin instead of min – Mohammed Ashiq Sep 19 '18 at 12:53

score 0 · Answer 2 · answered Sep 19 '18 at 13:37

0

You can group by two column:

df \
    .groupBy('name1', 'name2') \
    .agg(F.min('score'))

answered Sep 19 '18 at 13:37

hamza tuna

1,467
1
12
17

Getting the least set of rows in a groupby of a pyspark dataframe

2 Answers2