1

my tasks:

  1. load from the database matrices whose dimension is bigger than my RAM by using (pandas.read_sql(...) - database is postresql)
  2. operate on the numpy representation of such matrices (bigger than my RAM) using numpy

the problem: I get a memory error when even loading the data from the database.

my temporary quick and dirty solution: loop over chunks of the aforementioned data (so importing parts of the data at a time) thus allowing RAM to handle the workload. The issue at play here is speed related. time is significantly higher and before delving into Cython optimization and the like, I wanted to know whether there were some solutions (either in the forms of data structures like using the library shelving or the HDF5 format) to solve the issue

Asher11
  • 1,295
  • 2
  • 15
  • 31
  • 3
    Would you like to explore [dask](http://dask.pydata.org/en/latest/)? A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory. This could work for you. – Zero Dec 23 '16 at 17:07
  • i am open to explore anything that works. by the way super quick answer :) – Asher11 Dec 23 '16 at 17:07
  • 1
    indeed. I am going through the docs and tutorials now. I see the library is though a much smaller version than pandas (understandably since it's huge). but I also see that with some tricks (https://github.com/dask/dask/issues/943) and (http://stackoverflow.com/questions/31361721/python-dask-dataframe-support-for-trivially-parallelizable-row-apply) I might be able to get the job done. I'll still need to understand how to best wrap all of this but this definetely looks promising. thank you – Asher11 Dec 23 '16 at 17:18

0 Answers0