52

Given a 1.5 Gb list of pandas dataframes, which format is fastest for loading compressed data: pickle (via cPickle), hdf5, or something else in Python?

  • I only care about fastest speed to load the data into memory
  • I don't care about dumping the data, it's slow but I only do this once.
  • I don't care about file size on disk
smci
  • 32,567
  • 20
  • 113
  • 146
Jesper - jtk.eth
  • 7,026
  • 11
  • 36
  • 63
  • 8
    Have you tried measuring this, in your specific conditions? – pvg Jun 20 '16 at 17:52
  • 2
    I'm guessing that pickle will be one of the worst ways to dump this data :-). Of course, that's just a guess. I don't have any hard data to back it up. Speaking of hard data, why not do an experiment and find out? – mgilson Jun 20 '16 at 17:52
  • Well I am posting a question not an answer I guess ;) I can test pickle vs hdf5, but what if an expert on this site can direct me to a much better method that falls under "or something else"? :) – Jesper - jtk.eth Jun 20 '16 at 17:54
  • 2
    You may want to check this [comparison](http://stackoverflow.com/a/37012035/5741205)... – MaxU - stand with Ukraine Jun 20 '16 at 17:56
  • 4
    **you can profile this yourself** and if you are phishing for recommendations as your comment suggests, then that is **explicitly off-topic** : *Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.* –  Jun 20 '16 at 18:01
  • This has been asked before and the consensus every time is pretty much "it completely depends on your specific data." If one was wholly better then the other then you would have found the answer before posting your question. – Tadhg McDonald-Jensen Jun 20 '16 at 18:09
  • 1
    @TadhgMcDonald-Jensen "If one was wholly better then the other then you would have found the answer before posting your question." is just generally a strange logic. – Jesper - jtk.eth Jun 20 '16 at 18:30
  • I just meant that if there was a definitive answer it wouldn't have been closed as off topic and you would have found it on SO. – Tadhg McDonald-Jensen Jun 20 '16 at 18:32
  • Was this [pandas version 0.18.1](https://github.com/pandas-dev/pandas/releases?after=v0.20.0)? Everyone, please include pandas version no; performance comparisons without version no are near-useless on something as fast-changing as pandas. – smci Nov 15 '19 at 00:18

1 Answers1

87

UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.

Pro's and Contra's:

  • Parquet
    • pros
      • one of the fastest and widely supported binary storage formats
      • supports very fast compression methods (for example Snappy codec)
      • de-facto standard storage format for Data Lakes / BigData
    • contras
      • the whole dataset must be read into memory. You can't read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions.
        • no support for indexing. you can't read a specific row or a range of rows - you always have to read the whole Parquet file
      • Parquet files are immutable - you can't change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this "limitation" comes from the BigData and would be considered as one of the huge "pros" there.
  • HDF5
    • pros
      • supports data slicing - ability to read a portion of the whole dataset (we can work with datasets that wouldn't fit completely into RAM).
      • relatively fast binary storage format
      • supports compression (though the compression is slower compared to Snappy codec (Parquet) )
      • supports appending rows (mutable)
    • contras
  • Pickle
    • pros
      • very fast
    • contras
      • requires much space on disk
      • for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.

OLD Answer:

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 3
    Why do you only consider HDF5 and Feather, but not Pickle? Your result shows that it's quite good, there is also compressed pickle. Isn't it a good standard choice? – THN Jan 31 '18 at 03:31
  • 1
    @THN, If i recall correctly I saw some bugs in the past - I'm not sure though whether it's still the case... – MaxU - stand with Ukraine Jan 31 '18 at 21:29
  • upvoted! what if multiple users had to read the file in parallel whereas writes were being done by a single process – PirateApp Jun 27 '18 at 04:36
  • 1
    @PirateApp, multiple readers shouldn't be problem per se (IO might suffer off course). I don't know what will happen if single/multiple readers will try to read the data that is being written in the same time. It should be thoroughly tested. I would consider using one of RDBMS (Oracle, MySQL, PostgreSQL, etc.) or Hive, Spark, etc. for multi-user environments. – MaxU - stand with Ukraine Jun 27 '18 at 07:50
  • upvoted! thanks for sharing MaxU the reason i was asking is because i have groups of ohlc data per ticker and i dont want to load all at once into memory since it takes a significant amount, i want to do a full table scan iterating over each group, was curious if hdf5 would be suitable for that vs sqlite or other products, seems a bit weird to me considering full table scans on sql which in my opinion goes against the principles of rdbms design i guess, but multi user read is a must – PirateApp Jun 27 '18 at 08:25
  • 1
    @PirateApp, you might want to read [this docs](https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html) – MaxU - stand with Ukraine Jun 27 '18 at 08:30
  • 1
    @MaxU is HDF5 a longterm storage thing? I understand that regular pickles are not, but it would be great if pandas itself had a long-term storage solution. – MetaStack Jul 06 '19 at 22:14
  • 4
    @LegitStack, currently I would use either HDF5 or Parquet format - both of them are: 1) binary format 2) support compression 3) longterm storage 4) very fast compared to other formats – MaxU - stand with Ukraine Jul 06 '19 at 22:16
  • 2
    @PirateApp the `h5py` package describes your use case [here](http://docs.h5py.org/en/latest/swmr.html); they call it Single Writer Multiple Reader (SWMR). – Kyle Aug 09 '19 at 18:30
  • @AbdealiJK: Nonsense! [pickle protocol 4+ in Python 3.4+ can handleup to 4GB](https://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb). I'm successfully using 3Gb pickle.gzips without problem. Your information seems to be out of date (old Python version? protocol <=3?). Can you please delete the comment and add an updated one with up-to-date information? – smci Nov 19 '19 at 01:29
  • Reading a CSV file row by row is much faster than the panda.read_csv() function. I can read 11000 lines in less than .15 seconds but panda's function will take around 7s for the same file. It doesn't make sense why panda is in this benchmark. – LeanMan Oct 29 '20 at 03:19
  • @LeanMan, did you read the original question? It’s all about Pandas, so I don’t understand your comment... And one more thing- where do you store the data when you read it row by row and what are you doing with this data afterwards? – MaxU - stand with Ukraine Oct 29 '20 at 09:08
  • @MaxU, I was reading the answer. It uses Pandas.read_csv(). File is 2 MB, 11000 lines, 20 columns with float data points represented as ascii text, CSV. In each benchmark scenario, the chosen method needs to load the file and then calculate a moving average and compare it against the test data that is in one of the columns. So maybe not the best test but it follows a typical scenario for my use case which is load a file, do some work, verify that the work done was correct. Doing a file open() and csv spam reader is so much faster than these other solutions. I don't know what I'm missing. – LeanMan Oct 29 '20 at 15:43
  • Meaning, what am I missing to make hd5 work for me. Oh and regarding the CSV file, its stored on disk and loaded line by line or chunked in depending on the method chosen. The file format, the way the test loads the file and processes is entirely based on the method under test. Apples to apples comparison is done on whether or not the method achieved the desired behavior regardless of how I implemented underneath. Hopefully, its clear my intention is to validate these other methods. – LeanMan Oct 29 '20 at 15:44
  • @LeanMan, it doesn't make sense to compare reading CSV files using `csv.reader()` with `pd.read_csv()` as `csv.reader` gives you list of lists of __strings__, which must be parsed first to correct data types, then you need create a Pandas DataFrame out of that list and if you will measure the whole procedure which you are doing manually I doubt your code will be faster compared to `pd.read_csv()`. So please compare apples to apples! ;) – MaxU - stand with Ukraine Oct 29 '20 at 17:01
  • I think apples to apples is not implementation level but rather behavior. If both codes do what I want and one is faster, I think the comparison is legit. – LeanMan Oct 29 '20 at 17:55
  • @LeanMan, then i don't really understand why did you post your first comment at all? How does it help to answer the original question about Pandas DataFrame and how will it help those how will come to this page from search engines?? – MaxU - stand with Ukraine Oct 29 '20 at 18:24
  • Because I am one of those people who is coming in behind trying to make sense of your work. I mean its a comment not an answer to the OP. My comment is a question not an answer. My question is that if hdf5 is supposed to be a solution in regards to storage and io access benefits, then what am I not doing right with HDF5 to out perform a simple CSV line by line read. I mean we can move it to a chat if that is more appropriate - as SO suggests. – LeanMan Oct 29 '20 at 18:48