How to load large data into pandas efficiently?

Question

I am working with a very wide dataset (1005 rows * 590,718 columns, 1.2G). Loading such a large dataset into a pandas dataframe result in code failure entirely due to insufficient memory.

I am aware that Spark is probably a good alternative to Pandas for dealing with large datasets, but is there any amenable solution in Pandas to reduce memory usage while loading large data?

seen https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas ? — Jörn Hees, Feb 26 '18 at 13:30
If possible, convert csv into parquet format and use pyarrow or fast parquet packages in spark for faster processing. — args, Feb 26 '18 at 15:01

grshankar · Accepted Answer · 2018-07-16T13:55:59.667

2

You could use

pandas.read_csv(filename, chunksize = chunksize)

edited Jul 16 '18 at 13:55

answered Feb 26 '18 at 14:24

grshankar

417
2
14

Do I need to append chunks later on? My dataset is too wide. Is there similar functionality for columns or should I transpose my df? – RJF Feb 26 '18 at 16:01
1

you can follow it up with concat function as such : `chunk_df = pd.read_csv(filename, iterator=True, chunksize=chunksize)` `df = pd.concat(chunk_df, ignore_index=True)` – grshankar Feb 26 '18 at 16:20

How to load large data into pandas efficiently?

1 Answers1

Linked