Answered: It appears that this datatype will not be suited for adding arbitrary strings into hdf5store.
Background
I work with a script which generates single rows of results and appends them to a file on disk in an iterative approach. To speed things up, I decided to use HDF5 containers rather than .csv. A benchmarking then revealed that strings slow HDF5 down. I was told this can be mitigated when converting strings to categorical
dtype.
Issue
I have not been able to append categorical rows with new categories to HDF5. Also, I don't know how to control the dtypes of cat.codes
, which AFAIK can be done somehow.
Reproducible example:
1 - Create large dataframe with categorical data
import pandas as pd
import numpy as np
from pandas import HDFStore, DataFrame
import random, string
dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(100000)]
df_big = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])
df_big['Dummy_Data'] = df_big['Dummy_Data'].astype('category')
2 - Create one row to append
df_small = pd.DataFrame(['New_category'], columns = ['Dummy_Data'])
df_small['Dummy_Data'] = df_small['Dummy_Data'].astype('category')
3 - Save (1) to HDF and try to append (2)
df_big.to_hdf('h5_file.h5', \
'symbols_dict', format = "table", data_columns = True, append = False, \
complevel = 9, complib ='blosc')
df_small.to_hdf('h5_file.h5', \
'symbols_dict', format = "table", data_columns = True, append = True, \
complevel = 9, complib ='blosc')
This results in the following Exception
ValueError: invalid combinate of [values_axes] on appending data [name->Dummy_Data,cname->Dummy_Data,dtype->int8,kind->integer,shape->(1,)] vs current table [name->Dummy_Data,cname->Dummy_Data,dtype->int32,kind->integer,shape->None]
My fixing attempts
I tried to adjust the dtypes of cat.catcodes
:
df_big['Dummy_Data'] = df_big['Dummy_Data'].cat.codes.astype('int32')
df_small['Dummy_Data'] = df_small['Dummy_Data'].cat.codes.astype('int32')
When I do this, the error disappears, but so does the categorical dtype:
df_test = pd.read_hdf('h5_file.h5', key='symbols_dict')
print df_mydict.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100001 entries, 0 to 0 # The appending worked now
Data columns (total 1 columns):
Dummy_Data 100001 non-null int32 # Categorical dtype gone
dtypes: int32(1) # I need to change dtype of cat.codes of categorical
memory usage: 1.1 MB # Not of categorical itself
In addition, df_small.info()
does not show the dtype of cat.codes
in the first place, which makes it difficult to debug. What am I doing wrong?
Questions
1. How to properly change dtypes of cat.codes
?
2. How to properly append Categorical Data to HDF5 in python?