Pyspark - group by and select N highest values

Question

I have data like so:

+----------+----------+--------+
| Location | Product  | Amount |
+----------+----------+--------+
| London   | Fish     |    307 |
| London   | Chips    |    291 |
| London   | Beer     |    147 |
| Paris    | Baguettes|    217 |
| Paris    | Cheese   |    103 |
| Paris    | Champagne|     74 |
+----------+----------+--------+

Naturally there are many locations and many products per location. I want to end up at a dataframe like this:

+----------+---------------------+-------------------------+-------+-------------------------+
| Location | Most Common Product | 2nd Most Common Product |.....  | Nth Most Common Product |
+----------+---------------------+-------------------------+-------+-------------------------+
| London   | Fish                | Chips                   | ....  |     something           |
| Paris    | Baguettes           | Cheese                  | ....  |     something else      |
+----------+---------------------+-------------------------+-------+-------------------------+

I have figured out getting the most common, using this.

In extending it to the N most common, I could create another dataframe with those rows removed, run the process again to get the 2nd most common, and join them together by Location. With appropriate column naming this could be put into a loop to run N times adding a column each iteration.

However, this is going to be very slow since it would partition and join every iteration. How could I get, for example, the 50 most common per location, in a better way?

score 1 · Accepted Answer · answered Oct 28 '20 at 03:21

You can use pivot -

First, you need to create a row_number and then apply pivot based on that -

from pyspark.sql.functions import first
from pyspark.sql import functions as f

df_data.withColumn("row_number", f.row_number().over(Window.partitionBy("Location").orderBy(col("unit_count").desc())))

(df_data
    .groupby(df_data.Location)
    .pivot("row_number")
    .agg(first("Product"))
    .show())

Pyspark - group by and select N highest values

1 Answers1