I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?
Asked
Active
Viewed 539 times
1
-
1have you tried `randonSplit`? – shay__ Nov 16 '19 at 07:18
-
randomSplit is good option.In some scenarios I need exact number of sample records. – user2316771 Nov 17 '19 at 06:00
1 Answers
2
dataframe sample function can be used. Solution available in below link How to select an exact number of random rows from DataFrame

DataNoob
- 195
- 14