1

I have a dataframe with values

#+-------+---------+-----+
#|name1  |name 2   |score|
#+-------+---------+-----+
#| abcdef| abcghi  |    3|
#| abcdef| abcjkl  |    3|
#| abcdef| abcyui  |    3|
#| abcdef| abrtyu  |    4|
#| pqrstu| pqrswe  |    2|
#| pqrstu| pqrsqw  |    2|
#| pqrstu| pqrzxc  |    3|
#+-------+---------+-----+

I need to group by name1 and pick the rows with the least score.

I understand I can pick the top row after a groupby on name1 and sort the score in ascending order and pick the first row. I do this by

joined_windows = Window().partitionBy("name1").orderBy(col("score").asc())
result = joined_df.withColumn("rn", row_number().over(joined_windows)).where(col("rn") == 1).drop("rn")

But I want the dataframe to hold the following values (ie., set of rows with the least score in each group.

#+-------+---------+-----+
#|name1  |name 2   |score|
#+-------+---------+-----+
#| abcdef| abcghi  |    3|
#| abcdef| abcjkl  |    3|
#| abcdef| abcyui  |    3|
#| pqrstu| pqrswe  |    2|
#| pqrstu| pqrsqw  |    2|
#+-------+---------+-----+
Mohammed Ashiq
  • 468
  • 4
  • 18

2 Answers2

1

For hold several values such code can be used:

val joined_windows = Window.partitionBy("name1")
val result = df.withColumn("rn", min($"score").over(joined_windows))
result.where($"rn"===$"score").drop("rn").show(false)

Output:

+------+------+-----+
|name1 |name 2|score|
+------+------+-----+
|abcdef|abcghi|3    |
|abcdef|abcjkl|3    |
|abcdef|abcyui|3    |
|pqrstu|pqrswe|2    |
|pqrstu|pqrsqw|2    |
+------+------+-----+
pasha701
  • 6,831
  • 1
  • 15
  • 22
  • Says the column Score is not iterable. in the second line of your code – Mohammed Ashiq Sep 19 '18 at 12:48
  • Scala code provided, $"score" means the same as col("score") – pasha701 Sep 19 '18 at 12:50
  • Sorry Pasha. You are absolutely right. The reason was I was using min function which was overwritten by python. Could fix that error by `from pyspark.sql.functions import min as sparkMin` and using sparkMin instead of min – Mohammed Ashiq Sep 19 '18 at 12:53
0

You can group by two column:

df \
    .groupBy('name1', 'name2') \
    .agg(F.min('score'))
hamza tuna
  • 1,467
  • 1
  • 12
  • 17