0

I have a Dataframe which records the rank of item for each user, where there can be ties in rank. For those user-item pairs with ties, I want to break ties randomly for each user when select top k entries.

e.g. In this example, k = 3.

input:

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      2|   2|
|      2|      4|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
|      3|      3|   2|
+-------+-------+----+

One desired output is like:

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      2|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
+-------+-------+----+

Or below is good, and so does another two combinations (not listed here)

+-------+-------+----+
| user  | item  |rank|
+-------+-------+----+
|      1|      1|   1|
|      1|      2|   2|
|      2|      1|   1|
|      2|      3|   1|
|      2|      4|   2|
|      3|      2|   1|
|      3|      4|   1|
|      3|      1|   2|
+-------+-------+----+

Browsed all spark.sql.functions, didn't find anything quite useful, nor from Google.

Any help is appreciated!

pault
  • 41,343
  • 15
  • 107
  • 149
user21
  • 329
  • 2
  • 15
  • 1
    `row_number` is your friend in this situation – David Apr 18 '18 at 19:27
  • Use `row_number` as shown in the dupe link (`withColumn("rowNumber", row_number.over(Window.partitionBy("user").orderBy("rank")))`). Then select the rows where the row number is less than k, something like: `df.where(col("rowNumber") < k).sort("user", "rowNumber").drop("rowNumber")` – pault Apr 18 '18 at 20:27
  • Here is an example of what I described in my previous comment: https://stackoverflow.com/a/49714042/5858851 – pault Apr 18 '18 at 20:30

0 Answers0