I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.
I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.
tp = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})
df = pd.concat(tp,ignore_index=True)
I have used the dtype to reduce memory hog, still there is no improvement.
Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.
It would be great if anyone has a solution to this issue.
Please note:
I have a 64bit operating system(Windows 7)
I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]
I have 4GB Ram.
Numpy latest (pip installer says latest version installed)
Pandas Latest.(pip installer says latest version installed)