2

pd.read_csv(chunksize={chunksize}) Is a common approach when processing large csv file. But I don't understand why reading by chunk will reduce help with memory usage.

Comparing

df_list = pd.read_csv('huge_data.csv',chunksize=1000000)
df = pd.concat(df_list)

vs.

df = pd.read_csv('huge_data.csv')

Since at the end the whole huge_data.csv is read into memory anyway. Why isn't the same size of memory being used?

Reference:https://medium.com/analytics-vidhya/optimized-ways-to-read-large-csvs-in-python-ab2b36a7914e


Additional question: My service has constant health check, previously without chunksize, read_csv is blocking event loop long enough that my service is marked as dead. Now with process by chunk, service has been health since. Why? Does process by chunk release GIL?

Elaine Chen
  • 177
  • 1
  • 13
  • you have to use it in for-loop to read partially. And it reads only part of file which you can process and remove before you read next chunk. – furas May 30 '22 at 01:49
  • 1
    Both benchmarks (with `chunksize` and `dask`) in the referenced article are not valid. The first excludes `pd.concat` (where the data is actually loaded), the second doesn't load any data at all (dask is lazy and does nothing until `compute` is called). There is no difference in memory usage between the examples in the question. – Michael Szczesny May 30 '22 at 01:55
  • 1
    While processing the csv file, there is a small difference (about ~300mb) in **peak** memory usage. Loading with `chunksize`/`pd.concat` is ~1.22x faster (22.4s vs 18.3s). – Michael Szczesny May 30 '22 at 02:17
  • @MichaelSzczesny Can you publish your code? I'm assuming you are using some sort of memory-profiler thx – Elaine Chen May 30 '22 at 02:56
  • 2
    Essentially, I used [this approach](https://stackoverflow.com/a/7669482/14277722) to monitor the peak memory usage. On a [colab instance](https://colab.research.google.com/drive/1HWdWqKszWMjt8kINM3pM93FyzynKo4xC?usp=sharing) I get different results: 4.2 GB vs 3.2 GB peak memory, but slightly slower loading times with `chunksize`. I used the 1.3 GB csv file mentioned in the article for my benchmarks. – Michael Szczesny May 30 '22 at 07:03
  • @MichaelSzczesny I'm happy to give you accepted answer if you like to submit an answer – Elaine Chen May 31 '22 at 04:07

0 Answers0