2

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.

As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.

Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.

Code example in the solution would be great!

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • 1
    I think you need [`hdf5`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) – jezrael Nov 08 '16 at 08:02
  • You could indeed try storing the data in a different format, such as [bcolz](http://bcolz.blosc.org/en/latest/intro.html). However, you might also want to consider changing your process. For instance, you could try to make your script do more of the "empirical process" in a single run, or you could work with a subset of your data for a while before trying the process on the whole dataset. – BrenBarn Nov 08 '16 at 08:03
  • What kind of data (which dtypes) you are going to store? Is it only numerical data or are there also `datetime` and/or strings, categories, etc.? – MaxU - stand with Ukraine Nov 08 '16 at 08:24

1 Answers1

2

I would store already parsed DFs in one of the following formats:

All of them are very fast

PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419