load and operate on matrices bigger than RAM - python - numpy - pandas

Question

my tasks:

load from the database matrices whose dimension is bigger than my RAM by using (pandas.read_sql(...) - database is postresql)
operate on the numpy representation of such matrices (bigger than my RAM) using numpy

the problem: I get a memory error when even loading the data from the database.

my temporary quick and dirty solution: loop over chunks of the aforementioned data (so importing parts of the data at a time) thus allowing RAM to handle the workload. The issue at play here is speed related. time is significantly higher and before delving into Cython optimization and the like, I wanted to know whether there were some solutions (either in the forms of data structures like using the library shelving or the HDF5 format) to solve the issue

Would you like to explore [dask](http://dask.pydata.org/en/latest/)? A Dask DataFrame is a large parallel dataframe composed of many smaller Pandas dataframes, split along the index. These pandas dataframes may live on disk for larger-than-memory. This could work for you. — Zero, Dec 23 '16 at 17:07
i am open to explore anything that works. by the way super quick answer :) — Asher11, Dec 23 '16 at 17:07
indeed. I am going through the docs and tutorials now. I see the library is though a much smaller version than pandas (understandably since it's huge). but I also see that with some tricks (https://github.com/dask/dask/issues/943) and (http://stackoverflow.com/questions/31361721/python-dask-dataframe-support-for-trivially-parallelizable-row-apply) I might be able to get the job done. I'll still need to understand how to best wrap all of this but this definetely looks promising. thank you — Asher11, Dec 23 '16 at 17:18

load and operate on matrices bigger than RAM - python - numpy - pandas

0 Answers0