2

I have data set of 2.5 GB which contain tens of millions of rows

I'm trying to load data like

 %%time
 import pandas as pd
 data=pd.read_csv('C:\\Users\\mahes_000\\Desktop\\yellow.csv',iterator=True,
                  chunksize=50000)

Where I'm getting multiple of chunksize part and I'm trying to do some operations like

 %%time
 data.get_chunk().head(5)
 data.get_chunk().shape
 data.get_chunk().drop(['Rate_Code'],axis=1)

For operation it choose any one chunksize part and do all the operation it. Then what about the remaining parts? How can I do operations on complete data without memory-error.

yatu
  • 86,083
  • 12
  • 84
  • 139
Mahesh
  • 37
  • 5

1 Answers1

2

From the documentation on the parameter chunksize:

Return TextFileReader object for iteration

Thus by placing the object in a loop you will iteratively read the data in chunks specified in chunksize:

chunksize = 5e4
for chunk in pd.read_csv(filename, chunksize=chunksize):
    #print(chunk.head(5))
    #print(chunk.shape())
yatu
  • 86,083
  • 12
  • 84
  • 139