4

I am working with a very wide dataset (1005 rows * 590,718 columns, 1.2G). Loading such a large dataset into a pandas dataframe result in code failure entirely due to insufficient memory.

I am aware that Spark is probably a good alternative to Pandas for dealing with large datasets, but is there any amenable solution in Pandas to reduce memory usage while loading large data?

RJF
  • 427
  • 5
  • 16

1 Answers1

2

You could use

pandas.read_csv(filename, chunksize = chunksize)
grshankar
  • 417
  • 2
  • 14
  • Do I need to append chunks later on? My dataset is too wide. Is there similar functionality for columns or should I transpose my df? – RJF Feb 26 '18 at 16:01
  • 1
    you can follow it up with concat function as such : `chunk_df = pd.read_csv(filename, iterator=True, chunksize=chunksize)` `df = pd.concat(chunk_df, ignore_index=True)` – grshankar Feb 26 '18 at 16:20