So I have been trying to read a 3.2GB file in memory using pandas read_csv
function but I kept on running into some sort of memory leak, my memory usage would spike 90%+
.
So as alternatives
I tried defining
dtype
to avoid keeping the data in memory as strings, but saw similar behaviour.Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
Tried reading line by line ran into the same problem, but really slowly.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.
The file in question is a train.csv file from a kaggle competition grupo bimbo
System info:
RAM: 16GB, Processor: i7 8cores
Let me know if you would like to know anything else.
Thanks :)
EDIT 1: its a memory spike! not a leak (sorry my bad.)
EDIT 2: Sample of the csv file
Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3
EDIT 3: number rows in the file 74180465
Other then a simple pd.read_csv('filename', low_memory=False)
I have tried
from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')
UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.
import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
data.append(sub_data)
gc.collect()
EDIT: Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.
dtypes = {'Semana': pd.np.int8,
'Agencia_ID':pd.np.int8,
'Canal_ID':pd.np.int8,
'Ruta_SAK':pd.np.int8,
'Cliente_ID':pd.np.int8,
'Producto_ID':pd.np.int8,
'Venta_uni_hoy':pd.np.int8,
'Venta_hoy':pd.np.float16,
'Dev_uni_proxima':pd.np.int8,
'Dev_proxima':pd.np.float16,
'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)
This brought down the memory consumption to just under 4Gb