Sampling N rows for every key/value in a column using Pyspark

Asked Mar 21 '16 at 18:52

Active Mar 21 '16 at 18:58

Viewed 127 times

I have data which has X rows for every key(in this case it is a user). X is variable (for example, I have 1000 rows/data points for user 1 and 50 data points for user 2 - the data points are arranged by timestamp usually). What is the best way for me to get N random rows from the data for each key(each user)? I believe using samplebykey works if I have a fraction but I need N random rows for each key.

Also, in the case that the key has less than N rows, what will be returned?

edited Mar 21 '16 at 18:58

zero323

322,348
103
959
935

asked Mar 21 '16 at 18:52

Harish

Sampling N rows for every key/value in a column using Pyspark

0 Answers0