1

I have a rather large dataset (~15GB zipped). What is the most efficient way of random sampling from this dataset using Pandas? Currently I have the following way;

df = pd.read_csv (file, names = []
    , sep = '|', nrows=10000000)

However this really does not serve my need. Additionally is there a way I can filter the data before creating the dataframe?

Any help is appreciated :)

MSeifert
  • 145,886
  • 38
  • 333
  • 352

0 Answers0