0

I just wrote a csv file using pandas' to_csv function. I can see that the size of this file on disk is 13GB. I want to read this file back into a pandas dataframe using pd.read_csv. While reading this file in, I monitor memory usage of the server. Turns out that the memory consumed reading this file in, is 30GB+ and the file is never read in. The kernel of my jupyter notebook dies and I have to start the process once again.

My question is that why is such a behaviour happening? It's a very simple piece of code to write and read the file, so why are the space requirements different? And finally, how do I read this file in?

Patthebug
  • 4,647
  • 11
  • 50
  • 91
  • 2
    Possible duplicate of [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) – DarkCygnus Jun 26 '17 at 21:47

1 Answers1

0

Use chunks to minimize memory usage while loading.

import pandas as pd
chunksize = 10 ** 8
chunks=pd.read_csv(filename, chunksize=chunksize):
df=pd.concat(chunks, ignore_index=True)

If that doesn't work, this calls the garbage collector inside the for loop, and may have some minor performance improvement

import pandas as pd
import gc
chunksize = 10 ** 8
dfs = []
for chunk in pd.read_csv(filename, chunksize=chunksize):
    dfs.append(chunk)
    gc.collect()
final_df = pd.concat(dfs)
Matt
  • 782
  • 4
  • 11
  • 1
    as [this](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) question also indicates – DarkCygnus Jun 26 '17 at 21:47
  • 1
    Thanks a lot. I get the error `NameError: name 'process' is not defined`. I believe it needs an import. – Patthebug Jun 26 '17 at 22:22
  • 1
    So I used the exact same code and played around a little bit with the value of chunksize, I'm still running into `MemoryError`. Here's my code: `chunks=pd.read_csv('filename.csv',chunksize=10000) f=pd.DataFrame() %time df=pd.concat(chunks, ignore_index=True)` – Patthebug Jun 27 '17 at 17:22