I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv
work to no avail.
- It didn't work when I used the
iterator=True
andchunksize=number
parameters. Moreover, the smaller thechunksize
, the slower it is to process the same amount of data.- (Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
- Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
- (I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)