0

So I have a 15gb tab delimited file with 500million rows that I must read into Python and do some analysis on. What is the most efficient way to go about this ?

I have access to a Linux server with 4 cores and 16GB RAM. At the moment I am using dask.read_csv() with nworkers = 4 for my analysis. But I am regularly getting memory issue and the jupyter kernel just dies while doing some computations like groupby and iteration.

Are there any better ways to do this or any way to avoid memory full issue in dask?

user4157124
  • 2,809
  • 13
  • 27
  • 42
  • 1
    Does this answer your question? [Reading a huge .csv file](https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file) – ocirocir Nov 26 '20 at 16:11
  • Thanks but I have tried chunksize in pandas and similar methods but dask seems better for me as I need the groupby feature of dask on df, so I think if someone could enlighten me on how to do this with dask – Naive Bayes Nov 26 '20 at 16:33
  • 1
    There is not *general* better way, but there may be things you can do for your specific case. The more information you give about what exactly you are doing, the more likely the community can help. – mdurant Nov 26 '20 at 17:40
  • Not sure if helps but if you are could some simple IO like read_csv then to_parquet. Maybe easier to work with it as a parquet file. – Ray Bell Nov 29 '20 at 05:28

0 Answers0