Opening a 20GB file for analysis with pandas

Question

i am new to data Science and Dta Analytics i hope my question is not too naive. I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but i keep getting memory errors.

from your experience is it possible?
if not do you know know of a better to way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python) Every input will be welcome!

Thanks in advance.

Best practice aside, are you using 64bit Python? 32-bit Python has a 2GB memory limit. — Stev, Feb 12 '18 at 14:15
Possible duplicate of [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) — error, Feb 12 '18 at 14:15
Related: [“Large data” work flows using pandas](https://stackoverflow.com/q/14262433/190597) — unutbu, Feb 12 '18 at 14:15
The duplicate is for a smaller file size but the underlying issue is the same. — error, Feb 12 '18 at 14:15
Jupyter QtConsole 4.3.1 Hey Stev thanks for your answer i am working within Anaconda (64Bits ) Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 6.1.0 -- An enhanced Interactive Python. Type '?' — Boris K, Feb 12 '18 at 14:22

score 0 · Answer 1 · answered Feb 12 '18 at 14:14

0

You should try read and process one predefined chunk of data each time by using chunksize as explained here

for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
         # process your chunk here

answered Feb 12 '18 at 14:14

Gal Ben Arieh

413
4
11

Thanks Gal ben actually i cannot do my analysis in chunks.But you gave me a good idea here and i'm trying it out . So actually i am preprocessing the data in chunks first so i can reduce the amount o records and create a shorter excel filf from which a can derive a better dataframe. – Boris K Feb 15 '18 at 09:36
Np! Good luck :) Btw, The reason you get memory exceptions is because the loaded data is bigger in the memory than it’s on the disk. Pandas does some processing on the data in order do achieve fast querying and aggregation which makes it use much more memory than the raw file – Gal Ben Arieh Feb 22 '18 at 16:38

score 0 · Answer 2 · answered Feb 12 '18 at 14:18

Can you work with the data in chunks? If so you can use the iterator interface of pandas to go through the file.

df_iterator = pd.read_csv('test.csv', index_col=0, iterator=True, chunksize=5)
for df in df_iterator:
    print(df)
    # do something meaningful
    print('finished iteration on {} rows'.format(df.shape[0]))
    print()

Opening a 20GB file for analysis with pandas

2 Answers2