Pyspark: select top k entries in Dataframe, and break ties randomly

Question

I have a Dataframe which records the rank of item for each user, where there can be ties in rank. For those user-item pairs with ties, I want to break ties randomly for each user when select top k entries.

e.g. In this example, k = 3.

input:

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      2|   2|
|      2|      4|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
|      3|      3|   2|
+-------+-------+----+

One desired output is like:

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      2|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
+-------+-------+----+

Or below is good, and so does another two combinations (not listed here)

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      4|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
+-------+-------+----+

Browsed all spark.sql.functions, didn't find anything quite useful, nor from Google.

Any help is appreciated!

Use `row_number` as shown in the dupe link (`withColumn("rowNumber", row_number.over(Window.partitionBy("user").orderBy("rank")))`). Then select the rows where the row number is less than k, something like: `df.where(col("rowNumber") < k).sort("user", "rowNumber").drop("rowNumber")` — pault, Apr 18 '18 at 20:27
Here is an example of what I described in my previous comment: https://stackoverflow.com/a/49714042/5858851 — pault, Apr 18 '18 at 20:30

Pyspark: select top k entries in Dataframe, and break ties randomly

0 Answers0