1

I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?

enter image description here

user2316771
  • 111
  • 1
  • 1
  • 11

1 Answers1

2

dataframe sample function can be used. Solution available in below link How to select an exact number of random rows from DataFrame

DataNoob
  • 195
  • 14