I have a large file I need to load to a dataframe. I will need to work on it for a while. Is there a way of keeping in loaded in memory, so that if my script fails, I will not need to load it again ?
Asked
Active
Viewed 2,903 times
1
-
Maybe you can [pickle](http://docs.python.org/2/library/pickle.html) it using [`to_pickle`](http://pandas.pydata.org/pandas-docs/stable/io.html#pickling) – jezrael Jan 14 '16 at 08:28
-
And maybe [this](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/) help. – jezrael Jan 14 '16 at 09:16
-
Thanks ! How about other data structures ? like numpy matrices or objects ? – matlabit Jan 20 '16 at 07:37
-
numpy is easy `pd.DataFrame(numpyarray)` – jezrael Jan 20 '16 at 07:38
1 Answers
1
Here's an example of how one can keep variables in memory between runs.
For persistent storage beyond RAM, I would recommend looking into HDF5
. It's fast, simple, and allows for queries if necessary: (see docs).
It supports .read_hdf()
and .to_hdf()
similar to the _csv()
methods, but is significantly faster.
A simple illustration of storage and retrieval including query (from the docs) would be:
df = DataFrame(dict(A=list(range(5)), B=list(range(5))))
df.to_hdf('store_tl.h5','table', append=True)
read_hdf('store_tl.h5', 'table', where = ['index>2'])