Loading csv file in chunks

Question

I have data set of 2.5 GB which contain tens of millions of rows

I'm trying to load data like

 %%time
 import pandas as pd
 data=pd.read_csv('C:\\Users\\mahes_000\\Desktop\\yellow.csv',iterator=True,
                  chunksize=50000)

Where I'm getting multiple of chunksize part and I'm trying to do some operations like

 %%time
 data.get_chunk().head(5)
 data.get_chunk().shape
 data.get_chunk().drop(['Rate_Code'],axis=1)

For operation it choose any one chunksize part and do all the operation it. Then what about the remaining parts? How can I do operations on complete data without memory-error.

You need to loop through the iterator. `for i in data` and perform the operation. — Srce Cde, Nov 28 '18 at 07:25

yatu · Accepted Answer · 2018-11-28T15:34:36.357

2

From the documentation on the parameter chunksize:

Return TextFileReader object for iteration

Thus by placing the object in a loop you will iteratively read the data in chunks specified in chunksize:

chunksize = 5e4
for chunk in pd.read_csv(filename, chunksize=chunksize):
    #print(chunk.head(5))
    #print(chunk.shape())

edited Nov 28 '18 at 15:34

answered Nov 28 '18 at 09:00

yatu

86,083
12
84
139

Can you add some process on chunk so that i will get reference. – Mahesh Nov 28 '18 at 14:43
Well @Mahesh, `chunk` is a dataframe, so you can perform any process you have in mind directly on it. – yatu Nov 28 '18 at 14:48

Loading csv file in chunks

1 Answers1