10

Answered: It appears that this datatype will not be suited for adding arbitrary strings into hdf5store.

Background

I work with a script which generates single rows of results and appends them to a file on disk in an iterative approach. To speed things up, I decided to use HDF5 containers rather than .csv. A benchmarking then revealed that strings slow HDF5 down. I was told this can be mitigated when converting strings to categorical dtype.

Issue

I have not been able to append categorical rows with new categories to HDF5. Also, I don't know how to control the dtypes of cat.codes, which AFAIK can be done somehow.

Reproducible example:

1 - Create large dataframe with categorical data

import pandas as pd
import numpy as np
from pandas import HDFStore, DataFrame
import random, string

dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(100000)]
df_big = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])
df_big['Dummy_Data'] = df_big['Dummy_Data'].astype('category')

2 - Create one row to append

df_small = pd.DataFrame(['New_category'], columns = ['Dummy_Data'])
df_small['Dummy_Data'] = df_small['Dummy_Data'].astype('category')

3 - Save (1) to HDF and try to append (2)

df_big.to_hdf('h5_file.h5', \
      'symbols_dict', format = "table", data_columns = True, append = False, \
       complevel = 9, complib ='blosc')

df_small.to_hdf('h5_file.h5', \
      'symbols_dict', format = "table", data_columns = True, append = True, \
       complevel = 9, complib ='blosc')

This results in the following Exception

ValueError: invalid combinate of [values_axes] on appending data [name->Dummy_Data,cname->Dummy_Data,dtype->int8,kind->integer,shape->(1,)] vs current table [name->Dummy_Data,cname->Dummy_Data,dtype->int32,kind->integer,shape->None]

My fixing attempts

I tried to adjust the dtypes of cat.catcodes:

df_big['Dummy_Data'] = df_big['Dummy_Data'].cat.codes.astype('int32')
df_small['Dummy_Data'] = df_small['Dummy_Data'].cat.codes.astype('int32')

When I do this, the error disappears, but so does the categorical dtype:

df_test = pd.read_hdf('h5_file.h5', key='symbols_dict')
print df_mydict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100001 entries, 0 to 0       # The appending worked now
Data columns (total 1 columns):
Dummy_Data    100001 non-null int32      # Categorical dtype gone
dtypes: int32(1)                         # I need to change dtype of cat.codes of categorical    
memory usage: 1.1 MB                     # Not of categorical itself

In addition, df_small.info() does not show the dtype of cat.codes in the first place, which makes it difficult to debug. What am I doing wrong?

Questions

1. How to properly change dtypes of cat.codes?
2. How to properly append Categorical Data to HDF5 in python?

sudonym
  • 3,788
  • 4
  • 36
  • 61
  • 1
    See https://stackoverflow.com/a/37054761/4893407 for comments on categoricals in HDF with pandas. – Kyle Jun 13 '18 at 19:27

2 Answers2

3

if it is helpfull for you, I will rewrite the beginning of your code. It works for me.

import pandas as pd
from pandas import HDFStore, DataFrame
import random, string


def create_dummy(nb_iteration):

    dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(nb_iteration)]
    df = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])

    return df

df_small= create_dummy(53)
df_big= create_dummy(100000)

df_big.to_hdf('h5_file.h5', \
  'symbols_dict', format = "table", data_columns = True, append = False, \
  complevel = 9, complib ='blosc')

df_small.to_hdf('h5_file.h5', \
  'symbols_dict', format = "table", data_columns = True, append = True, \
  complevel = 9, complib ='blosc')

df_test = pd.read_hdf('test_def.h5', key='table')
df_test
Daniel Puiu
  • 962
  • 6
  • 21
  • 29
Amara BOUDIB
  • 343
  • 2
  • 6
  • This works because the categories in your df_small are not new - add 1 six letter string to df_small and you will see – sudonym Jun 13 '18 at 09:43
  • that being said, df_small is one row only. If you add more than 128 rows, the dtype of cat.codes will change to int32 and it will work - however, the question is about adding new categories (only) – sudonym Jun 13 '18 at 10:03
3

I am not an expert on this, but as far as I looked at least at h5py module, http://docs.h5py.org/en/latest/high/dataset.html , HDF5 supports Numpy datatypes, which do not include any categorical datatype.

Same for PyTables, which is used by Pandas.

Categories datatype is introduced and used in Pandas datatypes, and is described:

A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R)

So what might be happening is perhaps every time in order to add a new category, you have to somehow re-read all existing categories from hdf5store in order for Pandas to reindex it?

From the docs in general, however, it appears that this datatype will not be suited for adding arbitrary strings into hdf5store, unless you are sure after maybe a couple of additions there will be no new categories.

As additional note, unless your application demands extremely high performance, storing data in SQL might potentially be a better option -- SQL has better support for strings, for one thing. For example, while SQLite was found slower than HDF5 in some test, they didn't include processing strings. Jumping from CSV to HDF5 sounds like jumping from a horsecart to a rocket, but perhaps a car or airplane would work just as well (or better, as it has more options, to stretch the analogy)?

Gnudiff
  • 4,297
  • 1
  • 24
  • 25
  • Actually the readout of all present categories from hdf5 to pandas, followed by subsequent concatenation, followed by overwriting the hdf5 store was a temporary workaround. This comes to its limits of course as soon as the store is bigger than RAM. +1 for the answer. – sudonym Jun 13 '18 at 12:37
  • @sudonym this seems to support the view that all categories in pandas have to be known beforehand. – Gnudiff Jun 13 '18 at 12:41