Caching CSV-read data with pandas for multiple runs

Question

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.

As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.

Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.

Code example in the solution would be great!

I think you need [`hdf5`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) — jezrael, Nov 08 '16 at 08:02
You could indeed try storing the data in a different format, such as [bcolz](http://bcolz.blosc.org/en/latest/intro.html). However, you might also want to consider changing your process. For instance, you could try to make your script do more of the "empirical process" in a single run, or you could work with a subset of your data for a while before trying the process on the whole dataset. — BrenBarn, Nov 08 '16 at 08:03
What kind of data (which dtypes) you are going to store? Is it only numerical data or are there also `datetime` and/or strings, categories, etc.? — MaxU - stand with Ukraine, Nov 08 '16 at 08:24

score 2 · Accepted Answer · edited May 23 '17 at 10:32

2

I would store already parsed DFs in one of the following formats:

HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)

All of them are very fast

PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically

edited May 23 '17 at 10:32

Community

1
1

answered Nov 08 '16 at 08:03

MaxU - stand with Ukraine

205,989
36
386
419

Are these suitable for a 2.2gig file? – bluesummers Nov 08 '16 at 08:06
@bluesummers, yes, absolutely! Once i did [comparison](http://stackoverflow.com/a/37012035/5741205), but that time Feather was not available for Windows. Now i would definitely consider Feather in such a comparison – MaxU - stand with Ukraine Nov 08 '16 at 08:07
I don't see a to_feather method. How would I store to feather? – piRSquared Nov 08 '16 at 08:10
1

@piRSquared, please see this [answer](http://stackoverflow.com/a/39603864/5741205) - i've compared there a few methods – MaxU - stand with Ukraine Nov 08 '16 at 08:13
the data is not of the same types... I'm not familiar with HDF5, how should I go about converting the CSV to HDF5 (given different dtypes)? @MaxU – bluesummers Nov 08 '16 at 08:28
@bluesummers, please check [this answer](http://stackoverflow.com/a/38472574/5741205) - you should change `read_table` --> `read_csv`. You may also need to adjust `header`, `names`, etc. parameters... – MaxU - stand with Ukraine Nov 08 '16 at 08:30

Caching CSV-read data with pandas for multiple runs

1 Answers1