python pandas read and process a huge csv in chunks

Question

I am trying to process a huge csv file with panda Firstly, i come across a memory error when loading the file. I am able to fix to by this:

df = pd.read_csv('data.csv', chunksize=1000, low_memory=False)
device_data = pd.concat(df, ignore_index=True)

However, I still get memory errors when processing the "device_data" with multiple filters

Here are my questions: 1- Is there any way to get rid of memory errors when processing the dataframe loaded from that huge csv?

2- I have also tried adding conditions to concatenate dataframe with the iterators. Referring to this link [How can I filter lines on load in Pandas read_csv function?

iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])

However, the number of results seems much less than what it should be. Is there any advice from anyone?

Thanks.

update on 2019/02/19

I have managed to load the csv via this. However, it is noticed that the unmber of results (shown in df.shape) vary with different chunksize.....

    iter_csv = pd.read_csv('data.csv', iterator=True, chunksize=1000)
    df = pd.concat([chunk[chunk['ID'] == 1234567] for chunk in iter_csv])
    df.shape

It would be helpful if you can give us an idea how big the "huge" csv file is. Is it in the order of 1M rows by 1M cols for instance? — kentwait, Feb 18 '19 at 03:27
The size of the csv file is around 200Mb, and there are 1M+ rows by 54 cols. — Scorpioooooon21, Feb 18 '19 at 04:40
How much RAM do you have? I've worked with far larger files without even using chunks... What error do you get when calling simply pd.read_csv('data.csv')? If I understand correctly, you managed to get the whole frame into memory using the chunksize approach. If size is really the issue, you could reduze the size by typecasting the numerical columns from 64bit to 32bit or lower: e.g. df[int_columns] = df[int_columns].astype('int32') — nkaenzig, Feb 18 '19 at 19:03
@nkaenzig I have 8G RAN only for my desktop. The dtype of the columns are "object", it seems pandas does not allow to covert dtype from object to int. — Scorpioooooon21, Feb 19 '19 at 03:39
casting from object to int or float dtype should work if the column contains only numbers. The column 'ID' you used in the example seems a candidate to me for casting, as the IDs are probably all integer numbers? (however, 8GB should be actually enough. I loaded 1GB .csvs with only 6GB RAM). Maybe update your python/pandas version? — nkaenzig, Feb 19 '19 at 13:27
@nkaenzig The column ID in my csv contains "/" or some other characters, so it is definitely not int or float. Actually, now i am ok to load the csv to dataframe, but just not able to do any further filter work — Scorpioooooon21, Feb 20 '19 at 00:06
Actually, I think I have fixed these issues. I am aware of that I am running a 32-bits python in 64-bit OS. After upgrading python to 64-bits, it seems all the memory issues are gone. Thanks to @nkaenzig — Scorpioooooon21, Feb 20 '19 at 05:19

python pandas read and process a huge csv in chunks

0 Answers0