pd.read_csv(chunksize={chunksize})
Is a common approach when processing large csv file. But I don't understand why reading by chunk will reduce help with memory usage.
Comparing
df_list = pd.read_csv('huge_data.csv',chunksize=1000000)
df = pd.concat(df_list)
vs.
df = pd.read_csv('huge_data.csv')
Since at the end the whole huge_data.csv
is read into memory anyway. Why isn't the same size of memory being used?
Reference:https://medium.com/analytics-vidhya/optimized-ways-to-read-large-csvs-in-python-ab2b36a7914e
Additional question: My service has constant health check, previously without chunksize, read_csv is blocking event loop long enough that my service is marked as dead. Now with process by chunk, service has been health since. Why? Does process by chunk release GIL?