I have a df like below, and I need to select rows based on certain preferences within each prod_name
. For prod_name = A
, randomly select 3 rows. same for B
and C
, base on a preferred colour blue, green, yellow, red
. For example if there are blue
, then the rows with blue
would be selected first, followed by green
etc. The problem is, with prod_name = A
, there are too many rows with blue
. I need to put a limit to say for product A, randomly only select 2 rows with blue if there are any, and then select other colours based on the preferred order
I can sort of understand how to randomly select rows within a window based on the last answer from Choosing random items from a Spark GroupedData Object, but really not sure how to put a limit on certain colour and how to get things together. Could someone please help? Many thanks in advance.
prod_name | colour | prod_id
-------------------------------
A | blue |100
A | blue |200
A | blue. |300
A | blue. |300
A | yellow. |309
B | green. |408
B | blue. |50
C | red. |6000
C | blue |70
C | green |10