I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore
. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore
, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany()
with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore
, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.