When reading a large hdf file with pandas.read_hdf()
I get extremely slow read time. My hdf has 50 million rows, 3 columns with integers and 2 with strings. Writing this using to_hdf()
with table format and indexing took almost 10 minutes. While this is also slow, I am not too concerned as read speed is more important.
I have tried saving as fixed/table format, with/without compression, however the read time ranges between 2-5 minutes. By comparison, read_csv()
on the same data takes 4 minutes.
I have also tried to read the hdf using pytables directly. This is much faster at 6 seconds and this would be the speed I would like to see.
h5file = tables.open_file("data.h5", "r")
table = h5file.root.data.table.read()
I noticed all the speed comparisons in the documentation use only numeric data and running these myself achieved similar performance.
I would like to ask whether there is something I can do to optimise read performance?
Edit
Here is a sample of the data
col_A col_B col_C col_D col_E
30649671 1159660800 10217383 0 10596000 LACKEY
26198715 1249084800 0921720 0 0 KEY CLIFTON
19251910 752112000 0827092 104 243000 WEMPLE
47636877 1464739200 06247715 0 0 FLOYD
14121495 1233446400 05133815 0 988000 OGU ALLYN CH 9
41171050 1314835200 7C140009 0 39000 DEBERRY A
45865543 1459468800 0314892 76 254000 SABRINA
13387355 970358400 04140585 19 6956000 LA PERLA
4186815 849398400 02039719 0 19208000 NPU UNIONSPIELHAGAN1
32666568 733622400 10072006 0 1074000 BROWN
And info on the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52046850 entries, 0 to 52046849
Data columns (total 5 columns):
col_A int64
col_B object
col_C int64
col_D int64
col_E object
dtypes: int64(3), object(2)
memory usage: 1.9+ GB