currently, the data in h5 file does not have prefix 'b'. I read h5 file with following code. I wonder whether there is some better way to read h5 and with no prefix 'b'.
import tables as tb
import pandas as pd
import numpy as np
import time
time0=time.time()
pth='d:/download/'
# read data
data_trading=pth+'Trading_v01.h5'
filem=tb.open_file(data_trading,mode='a',driver="H5FD_CORE")
tb_trading=filem.get_node(where='/', name='wind_data')
df=pd.DataFrame.from_records(tb_trading[:])
time1=time.time()
print('\ntime on reading data %6.3fs' %(time1-time0))
# in python3, remove prefix 'b'
df.loc[:,'Date']=[[dt.decode('utf-8')] for dt in df.loc[:,'Date']]
df.loc[:,'Code']=[[cd.decode('utf-8')] for cd in df.loc[:,'Code']]
time2=time.time()
print("\ntime on removing prefix 'b' %6.3fs" %(time2-time1))
print('\ntotal time %6.3fs' %(time2-time0))
the result of time
time on reading data 1.569s
time on removing prefix 'b' 29.921s
total time 31.490s
you see, removing prefix 'b' is really time consuming.
I have try to use pd.read_hdf, which don't rise prefix 'b'.
%time df2=pd.read_hdf(data_trading)
Wall time: 14.7 s
which so far is faster.
Using this SO answer and using a vectorised str.decode
, I can cut the conversion time to 9.1 seconds (and thus the total time less than 11 seconds):
for key in ['Date', 'Code']:
df[key] = df[key].str.decode("utf-8")
Question: is there an even more effective way to convert my bytes columns to string when reading a HDF 5 data table?