13

I want to store a dataFrame with different columns into an hdf5 file (find an excerpt with data types below).

In  [1]: mydf
Out [1]:
endTime             uint32
distance           float16
signature         category
anchorName        category
stationList         object

Before converting some columns (signature and anchorName in my excerpt above), I used code like following to store it (which works pretty fine):

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2')

But it does not work with category and then I tried following:

path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2')

It works fine, if I remove the column stationList, where each entry is a list of strings. But with this column I got the following exception:

Cannot serialize the column [stationList] because
its data contents are [mixed] object dtype

How do I need to improve my code to get the data frame stored?

pandas version: 0.17.1
python version: 2.7.6 (cannot change it due to compability reasons)


edit1 (some sample code):

import pandas as pd

mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]),
                    'distance' : pd.Series([454.75,477.25,242.12]),
                    'signature' : pd.Series(['ab','cd','ab']),
                    'anchorName' : pd.Series(['tec','ing','pol']),
                    'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']])
                    })

# this works fine (no category)
mydf.to_hdf('tmp_without_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

for col in ['anchorName', 'signature']:
    mydf[col] = mydf[col].astype('category')

# this crashes now because of category data
# mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', complevel=9, complib='bzip2')

# switching to format='t'   
# this caused problems because of "mixed data" in column stationList
mydf.to_hdf('tmp_with_cat.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

mydf.pop('stationList')

# this again works fine
mydf.to_hdf('tmp_with_cat_without_stationList.hdf5', 'journeys', mode='w', format='t', complevel=9, complib='bzip2')

edit2: In the meanwhile I tried different things to get rid of this problem. One of these were to convert the entries of column stationList to tupels (possible since they shall not be changed) and to also convert it to category. But it did not change anything. Here are the lines I added after the conversion loop (just for completeness):

mydf.stationList = [tuple(x) for x in mydf.stationList.values]
mydf.stationList.astype('category')
AnnetteC
  • 490
  • 2
  • 5
  • 20
  • Is this a question? Also it would help if you posted actual code to create a test dataframe. – Stop harming Monica Feb 03 '16 at 12:05
  • That is problem. I get the data out of other files which were stored by some other scripts. I will try to create a basic data generation which can show my problem. – AnnetteC Feb 03 '16 at 12:19
  • It looks like you cannot store both categorical and lists/tuples in the same hdf5 format by now (this might be fixed in the future). I cannot tell you what to change without knowing more about your requirements. Maybe leave the strings as strings, maybe choose a different representation for stationList items... there are too many options. – Stop harming Monica Feb 03 '16 at 23:17
  • Yes, of course. There are some options. One idea was also to convert the stationList entries to a string with a defined separator. But this would cause a lot of changes to the existing code... I really new to Python and just shall maintain an existing project. – AnnetteC Feb 04 '16 at 09:40

1 Answers1

7

You have two problems:

  1. You want to store categorical data in a HDF5 file;
  2. You're trying to store arbitrary objects (i.e. stationList) in a HDF5 file.

As you discovered, categorical data is (currently?) only supported in the "table" format for HDF5.

However, storing arbitrary objects (list of strings, etc.) is really not something that is supported by the HDF5 format itself. Pandas working around that for you by serializing these objects using pickle, and then storing the pickle as an arbitrary-length string (which is not supported by all HDF5 formats, I think). But that will be slow and inefficient, and will never be supported well by HDF5.

In my mind, you have two options:

  1. Pivot your data so you have one row of data by station name. Then you can store everything in a table-format HDF5 file. (This is a good practice in general; see Hadley Wickham on Tidy Data.)
  2. If you really want to keep this format, then you might as well save the whole dataframe using to_pickle(). This will have no problem dealing with any kind of object (e.g. list of strings, etc.) you throw at it.

Personally, I would recommend option 1. You get to use a fast, binary file format. And the pivot will also make other operations with your data easier.

Christian Hudon
  • 1,881
  • 1
  • 21
  • 42
  • 5
    This problem is very standard, but solutions are few – tensor Jun 11 '17 at 06:30
  • Thanks! This helped a lot understanding what's the deal. Too bad HDF5 does not support array columns. They are quite useful at times. – Ufos Jun 22 '18 at 13:15
  • @Ufos You might also want to have a look at Apache Arrow (https://arrow.apache.org/). It's not an on-disk format in itself, but it has more support for nested data types. https://arrow.apache.org/ – Christian Hudon Jun 22 '18 at 20:06
  • One workaround is to store `stationList` as string, then it will not complain about mixed-types error. – Oli Aug 28 '18 at 10:27