How to select n rows from large data set using spark

Question

I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?

randomSplit is good option.In some scenarios I need exact number of sample records. — user2316771, Nov 17 '19 at 06:00

score 2 · Answer 1 · answered Nov 16 '19 at 10:30

2

dataframe sample function can be used. Solution available in below link How to select an exact number of random rows from DataFrame

answered Nov 16 '19 at 10:30

DataNoob

195
14

How to select n rows from large data set using spark

1 Answers1