3

I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below.

def merge_hdfs(file_list, merged_store):
    for file in file_list:
        store = HDFStore(file, mode='r')
        merged_store.append('data', store.data)
        store.close()

The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store.

I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.

I wonder if I am missing something?

For additional information, here is a description of one of the 500 PyTables.

/data/table (Table(272734,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
  "id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
  "datetaken": Int64Col(shape=(), dflt=0, pos=3),
  "owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
  "machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
  "title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
  "country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
  "place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
  "url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
  "url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
  "ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
  "tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
  byteorder := 'little'
  chunkshape := (232,)
Community
  • 1
  • 1
danieleb
  • 115
  • 2
  • 8

1 Answers1

3

This is basically the answer here, which I recently answered.

Bottom line is this, you need to turn off indexing store.append('df',df,index=False). When creating the store, then index it at the end.

Furthermore turn off compression when merging the tables as well.

Indexing is a fairly expensive operation and if I recall correctly, only uses a single processor.

Finally, make sure that you create the merged with with mode='w' as all of the subsequent operations are appends and you want to start with a clean new file.

I also would NOT specify the chunksize upfront. Rather, after you have created the final index, perform the compression using ptrepack and specify chunksize=auto which will compute it for you. I don't think this will affect write performance but will optimize query performance.

You might try tweaking the chunksize parameter to append (this is the writing chunksize) to a larger number as well.

Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).

I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837

Parsa
  • 3,054
  • 3
  • 19
  • 35
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thank you, removing the indexing is speeding up the process substantially. However, I still get very large file sizes: for every 300Mb table that I append to the merged store, I get an increase in size of 1Gb which eventually is going to fill in my disk. This should not be due to compression, as I did not apply compression to the 300Mb files. – danieleb Apr 08 '14 at 12:54
  • are the tables EXACTLY the same? When you concat them, the strings will all be of the resulting table string-sizes (which could be bigger depending on what the minimum is) – Jeff Apr 08 '14 at 12:56
  • I have parsed the data so that the size of long strings in every table is restricted to fixed values (to avoid errors in the `append` operation). The tables are not exactly the same, and their size also varies slightly, but I have kept the same `min_itemsize` parameters for the merged table, and so I would not expect strings to change size... – danieleb Apr 08 '14 at 13:01
  • are you sure that ``min_itemsize`` is specified for EVERY field? (e.g. do a ``ptdump -av`` comparison for an individual file and the merged one. This can make a huge difference. Say a field is NOT specified with a ``min_itemsize`` in the individual file but IS in the merged file you easily double the size (e.g. say its 10 in the individual file, but 100 in the merged). Compression should help a lot with this though. So compare the compressed files as well. Seeing that your dataset if pretty large, I would certainly compress it using ``blosc`` as it will make queries FASTER as well. – Jeff Apr 08 '14 at 13:12
  • Because all the tables were generated with the same `min_itemsize` parameters (and contained strings of fixed maximum length), I assumed that the fields of the merged table would automatically be created with the right size without explicitly passing the `min_itemsize` parameter to the append function. I have now done this and everything seems to be ok. I will follow your suggestion about the compression, thank you again for your help! – danieleb Apr 08 '14 at 13:35
  • gr8. post here with some stats! – Jeff Apr 08 '14 at 13:39
  • another suggestion. if your data (say tags, a big string field) is somewhat sparse, in that it is contained on certain rows but not others. It *might* pay to keep that in a separate table (with an index key), then effectively join it when selecting it. Alternatively, it might pay to store these in another table as a set of sub-tags (presupposing that say these are a bunch of sub-fields), and indexing in to the main table. All that said, disk is cheap :) relatively speaking to your time. – Jeff Apr 08 '14 at 16:20
  • the index flag doesn't seem to appear in the documentation? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html – Parsa Oct 01 '18 at 13:23