how do I avoid strings being read as bytes when reading a HDF 5 file into Pandas?

Question

currently, the data in h5 file does not have prefix 'b'. I read h5 file with following code. I wonder whether there is some better way to read h5 and with no prefix 'b'.

import tables as tb
import pandas as pd
import numpy as np
import time

time0=time.time()
pth='d:/download/'

# read data
data_trading=pth+'Trading_v01.h5'
filem=tb.open_file(data_trading,mode='a',driver="H5FD_CORE")
tb_trading=filem.get_node(where='/', name='wind_data')
df=pd.DataFrame.from_records(tb_trading[:])
time1=time.time()
print('\ntime on reading data %6.3fs' %(time1-time0))

# in python3, remove prefix 'b'
df.loc[:,'Date']=[[dt.decode('utf-8')] for dt in df.loc[:,'Date']]
df.loc[:,'Code']=[[cd.decode('utf-8')] for cd in df.loc[:,'Code']]

time2=time.time()
print("\ntime on removing prefix 'b' %6.3fs" %(time2-time1))
print('\ntotal time %6.3fs' %(time2-time0))

the result of time

time on reading data 1.569s

time on removing prefix 'b' 29.921s

total time 31.490s

you see, removing prefix 'b' is really time consuming.

I have try to use pd.read_hdf, which don't rise prefix 'b'.

%time df2=pd.read_hdf(data_trading)
Wall time: 14.7 s

which so far is faster.

Using this SO answer and using a vectorised str.decode, I can cut the conversion time to 9.1 seconds (and thus the total time less than 11 seconds):

 for key in ['Date', 'Code']: 
     df[key] = df[key].str.decode("utf-8")

Question: is there an even more effective way to convert my bytes columns to string when reading a HDF 5 data table?

@Evert thanks, your suggestion is good. I use `df['Date']=df['Date'].str.decode("utf-8") df['Code']=df['Code'].str.decode("utf-8")` and the result is **time on removing prefix 'b' 9.315s**. — Renke, Jul 23 '17 at 01:41
I try the expression for multi-columns `str_df = df.loc[:,['Date','Code']] str_df = str_df.stack().str.decode('utf-8').unstack() for col in str_df: df[col] = str_df[col]` and the result is worse, the time is about 19s — Renke, Jul 23 '17 at 01:43
I may be that `stack()` is not efficient here, perhaps depending on the layout of the frame. That would be food for another question though (which is possibly really about the Pandas internals). But I personally be happy with `for key in ['Data', 'Code']: df[key] = df[key].str.decode("utf-8")`. (Would have been nice if `df[['Data', 'Code']] = df[['Data', 'Code']].str.decode("utf-8")` were possible, though.) — , Jul 23 '17 at 01:50
@Evert yeah, you advise a better question title for my question. — Renke, Jul 23 '17 at 01:58
df[['Date', 'Code']] = df[['Date', 'Code']].str.decode("utf-8") does not work, it report error "AttributeError: 'DataFrame' object has no attribute 'str'". `for key in ['Date', 'Code']: df[key] = df[key].str.decode("utf-8")` works. — Renke, Jul 23 '17 at 02:01
Up to you: if you're happy with the solution, close it as a duplicate. Or you can change the question (title and contents), include the suggestion from the other question's answer in your question, and ask if there's a still more efficient way to do things. — , Jul 23 '17 at 02:02
I did mention "Would have been [...] if [...] were possible" for the last idea, because I know it isn't (yet); hence it's also between parentheses. Lots of conditionals there ;-). — , Jul 23 '17 at 02:03
I've rephrased things a bit in your question, to give appropriate attribution by linking to the relevant question/answer, and hopefully comparing the various methods slightly more clearly. You can always roll back the changes if you disagree, or make further edits as you see fit. — , Jul 23 '17 at 02:09
Can you try with `str.decode('ascii')` instead of utf-8 and tell us what time it takes then? — John Zwinck, Jul 23 '17 at 02:31

John Zwinck · Accepted Answer · 2017-07-23T02:58:08.870

The best solution for performance is to stop trying to "remove the b prefix." The b prefix is there because your data consists of bytes, and Python 3 insists on displaying this prefix to indicate bytes in many places. Even places where it makes no sense such as the output of the built-in csv module.

But inside your own program this may not hurt anything, and in fact if you want the highest performance you may be better off leaving these columns as bytes. This is especially true if you're using Python 3.0 to 3.2, which always use multi-byte unicode representation (see).

Even if you are using Python 3.3 or later, where the conversion from bytes to unicode doesn't cost you any extra space, it may still be a waste of time if you have a lot of data.

Finally, Pandas is not optimal if you are dealing with columns of mostly unique strings which have a somewhat consistent width. For example if you have columns of text data which are license plate numbers, all of them will fit in about 9 characters. The inefficiency arises because Pandas does not exactly have a string column type, but instead uses an object column type, which contains pointers to strings stored separately. This is bad for CPU caches, bad for memory bandwidth, and bad for memory consumption (again, if your strings are mostly unique and of similar lengths). If your strings have highly variable widths, it may be worth it because a short string takes only its own length plus a pointer, whereas the fixed-width storage typical in NumPy and HDF5 takes the full column width for every string (even empty ones).

To get fast, fixed-width string columns in Python, you may consider using NumPy, which you can read via the excellent h5py library. This will give you a NumPy array which is a lot more similar to the underlying data stored in HDF5. It may still have the b prefix, because Python insists that non-unicode strings always display this prefix, but that's not necessarily something you should try to prevent.

thanks for you advice, yeah, what I thought is find a fast way to deal with data. as you said, use h5py and np may faster. while I need to clean the data with some pandas methods, well, I will think of your suggestion and see I can find a better way. as the prefix 'b', I find I have some problem when I deal with 'date' if there is prefix 'b'. so I think I need to remove it. — Renke, Jul 23 '17 at 03:05
@Renke: What do you mean about 'date'? If you're trying to parse date strings from text you can do that using `pd.to_datetime()` among others. Again I strongly advise you to stop thinking about this in terms of "removing the `b` prefix" and instead think about it in terms of finding optimal solutions for working with your data. It's not as if you have strings that actually start with `b`--Python just prints them this way. — John Zwinck, Jul 23 '17 at 03:20
the pd.to_datetime can not work without removing prefix 'b', report error "TypeError: is not convertible to datetime". — Renke, Jul 23 '17 at 04:07
@Renke: Right, if you have an array of `bytes` you should use `arr.astype('datetime64[D]')` (or use `[ns]` or other units depending on what sort of dates/times they are). In other words, use NumPy rather than Pandas for this part. If you have the data in Pandas you can still use `.astype('datetime64[D]')` and it works even with `bytes` input. There is no need to "remove the prefix" when you can just go straight from your `bytes` input to the datetime type you actually want. — John Zwinck, Jul 23 '17 at 06:47
thanks. **.astype('datetime64[D]')** works, it talks only about 1.5s. — Renke, Jul 23 '17 at 08:51

how do I avoid strings being read as bytes when reading a HDF 5 file into Pandas?

1 Answers1