HDFStore with string columns gives issues

Question

I have a pandas DataFrame myDF with a few string columns (whose dtype is object) and many numeric columns. I tried the following:

d=pandas.HDFStore("C:\\PF\\Temp.h5")
d['test']=myDF

I got this result:

C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\pytables.py:2446: PerformanceWarning: 

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] 
[items->[0, 1, 3, 4, 5, 6, 9, 10, 292, ...]]

warnings.warn(ws, PerformanceWarning)

It seems like the issue is occurring for every column that is a string. For example if I try

myDF[0].dtype

I get

Out[38]: dtype('O')

How can I fix the issue, i.e. change the dtype for string columns so that HDFStore can treat it like a string column?

EDIT

More info as requested

>>> pandas.__version__
Out[49]: '0.13.1'

>>> tables.__version__
Out[53]: '3.1.0'

Constructing the pandas data frame as follows:

pandas.read_csv(fName,sep="|",header=None, low_memory=False)

When I try

myDF.info()

I get

Int64Index: 153895 entries, 0 to 153894
Data columns (total 644 columns):
0      object
1      object
2      int64
3      object
4      object
5      object
6      object
7      int64
8      float64
9      object
10     object
11     float64
12     float64
...
...
642    float64
643    float64
dtypes: float64(619), int64(2), object(23)

All string columns have been read as object.

can u show pandas version, pytables version, os, df.info(), how constructed the df, and a sample — Jeff, Apr 10 '14 at 21:04
why are you passing ``low_memory``? do you have unicode in any strings? — Jeff, Apr 10 '14 at 21:11
because the file is too large, and without `low_memory` it doesn't seem to work. Here is the error `C:\PF\WinPython-64bit-3.3.3.3\python-3.3.3.amd64\lib\site-packages\pandas\io\parsers.py:1070: DtypeWarning: Columns (6,292,479,572,581,590,599,608,617,626,635) have mixed types. Specify dtype option on import or set low_memory=False. data = self._reader.read(nrows)` — uday, Apr 10 '14 at 21:12
ok, are you on 32-bit? read in by chunks, and create a ``table`` store instead. — Jeff, Apr 10 '14 at 21:13
no, I am on 64-bit. see error above. Also, what do you mean by a `table` store? — uday, Apr 10 '14 at 21:13
you are creating a ``fixed`` store, see here: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables — Jeff, Apr 10 '14 at 21:14
but your problem is not that really, its the mixed dtypes in a column. read in by chunks then either append to a list and concat, or append as you go to a ``table`` store. mixed types in a column are really bad — Jeff, Apr 10 '14 at 21:15
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk — Jeff, Apr 10 '14 at 21:16
http://stackoverflow.com/questions/20428355/appending-column-to-frame-of-hdf-file-in-pandas/20428786#20428786 — Jeff, Apr 10 '14 at 21:17
don't use ``low_memory`` flag, its not documented because it allows columns to have mixed dtypes, you never want that. — Jeff, Apr 10 '14 at 21:18
thanks, the `fixed` store option works. is there a way to convert any column with apparently mixed types to be treated as strings when using `read_csv`?? — uday, Apr 10 '14 at 21:21

Jeff · Accepted Answer · 2014-04-10T21:25:32.237

20

This warning ONLY happens if you have mixed-types IN a column. Not just strings, but string AND numbers.

In [2]: DataFrame({ 'A' : [1.0,'foo'] }).to_hdf('test.h5','df',mode='w')
pandas/io/pytables.py:2439: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['A']]

  warnings.warn(ws, PerformanceWarning)

In [3]: df = DataFrame({ 'A' : [1.0,'foo'] })

In [4]: df
Out[4]: 
     A
0    1
1  foo

[2 rows x 1 columns]

In [5]: df.dtypes
Out[5]: 
A    object
dtype: object

In [6]: df['A']
Out[6]: 
0      1
1    foo
Name: A, dtype: object

In [7]: df['A'].values
Out[7]: array([1.0, 'foo'], dtype=object)

So, you need to ensure that you don't mix WITHIN a column

If you have columns that need conversion you can do this:

In [9]: columns = ['A']

In [10]: df.loc[:,columns] = df[columns].applymap(str)

In [11]: df
Out[11]: 
     A
0  1.0
1  foo

[2 rows x 1 columns]

In [12]: df['A'].values
Out[12]: array(['1.0', 'foo'], dtype=object)

edited Apr 10 '14 at 21:25

answered Apr 10 '14 at 21:09

Jeff

125,376
21
220
187

4

is there any option in `read_csv` to specify it to treat any column as string if it is appears as MIXED? – uday Apr 10 '14 at 21:11
1

I tried `df.loc[:,columns] = df[columns].applymap(str)` but it did not change the `dtype` in my case from `object` to string. Even in your example, the `dtype` doesn't change from `object` to string – uday Apr 10 '14 at 21:30
1

dtype won't change, it will still be ``object``. The embeded values will be strings though. THAT's the problem. In the data you are reading the embeded objects are float/ints (actual python objects), and NOT strings. So when the frame is being written to the store, they are objects and NOT strings (and that is why you get the warning) – Jeff Apr 10 '14 at 21:37
1

For blanks, or NaN, this outputs nan as a string which is actually shown in the file (not desired). Should I replace those with np.nan or will that cause the column to be an object again? Or should I use a fillna('Blank') or something? It seems like that will eat up space, but then again I have a lot of space. – trench May 18 '16 at 16:54

HDFStore with string columns gives issues

1 Answers1

Linked