0

I have a df like below, and I need to select rows based on certain preferences within each prod_name. For prod_name = A, randomly select 3 rows. same for B and C, base on a preferred colour blue, green, yellow, red. For example if there are blue, then the rows with blue would be selected first, followed by green etc. The problem is, with prod_name = A, there are too many rows with blue. I need to put a limit to say for product A, randomly only select 2 rows with blue if there are any, and then select other colours based on the preferred order

I can sort of understand how to randomly select rows within a window based on the last answer from Choosing random items from a Spark GroupedData Object, but really not sure how to put a limit on certain colour and how to get things together. Could someone please help? Many thanks in advance.

prod_name | colour | prod_id   
-------------------------------
  A      | blue    |100        
  A      | blue    |200
  A      | blue.   |300
  A      | blue.   |300
  A      | yellow. |309
  B      | green.  |408
  B      | blue.   |50
  C      | red.    |6000
  C      | blue    |70
  C      | green   |10
user3735871
  • 527
  • 2
  • 14
  • 31

0 Answers0