12

I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.

I have attempted the following:

Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.

However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.

pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.

If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype) however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.

Appreciate if you guys could offer your thoughts and point me in the right direction.

Martin Schapendonk
  • 12,893
  • 3
  • 19
  • 24
smartexpert
  • 2,625
  • 3
  • 24
  • 41
  • 7
    your current approach (with a csv) and dtype correction is right. SQL will get a major update in 0.14 (0.13 releasing shortly). So unfortunately dtype infererence/chunking is not available. Welcome PRS's though! see this issue: https://github.com/pydata/pandas/issues/4163 – Jeff Dec 16 '13 at 18:54
  • 2
    I suggest removing the Oracle tag unless you are having any issues on the Oracle side of things. – WW. Mar 04 '14 at 23:33
  • 1
    You should ask your dba to convert the table to a range partitioned object, after that should be easy to access by partition – Juan Diego Godoy Robles Mar 05 '14 at 08:08
  • 2
    Wouldn't it solve the problem, if you would not export all data to a single CSV-File but just chunk this file up into several bits? If you have to do this job regularily this will also reduce space requirements and runtime, if the Oracle Export exports one small CSV after the other and the HDF5 process imports them in parallel, deleting finished files... – Falco Apr 09 '14 at 14:20

2 Answers2

1

Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:

http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata

Another approach, using Pandas, is here:

"Large data" work flows using pandas

Community
  • 1
  • 1
LetMeSOThat4U
  • 6,470
  • 10
  • 53
  • 93
0

Okay, so I don't have much experience with oracle databases, but here's some thoughts:

Your access time for any particular records from oracle are slow, because of a lack of indexing, and the fact you want data in timestamp order.

Firstly, you can't enable indexing for the database?

If you can't manipulate the database, you can presumably request a found set that only includes the ordered unique ids for each row?

You could potentially store this data as a single array of unique ids, and you should be able to fit into memory. If you allow 4k for every unique key (conservative estimate, includes overhead etc), and you don't keep the timestamps, so it's just an array of integers, it might use up about 1.1GB of RAM for 3 million records. That's not a whole heap, and presumably you only want a small window of active data, or perhaps you are processing row by row?

Make a generator function to do all of this. That way, once you complete iteration it should free up the memory, without having to del anything, and it also makes your code easier to follow and avoids bloating the actual important logic of your calculation loop.

If you can't store it all in memory, or for some other reason this doesn't work, then the best thing you can do, is work out how much you can store in memory. You can potentially split the job into multiple requests, and use multithreading to send a request once the last one has finished, while you process the data into your new file. It shouldn't use up memory, until you ask for the data to be returned. Try and work out if the delay is the request being fulfilled, or the data being downloaded.

From the sounds of it, you might be abstracting the database, and letting pandas make the requests. It might be worth looking at how it's limiting the results. You should be able to make the request for all the data, but only load the results one row at a time from the database server.

TinBane
  • 866
  • 11
  • 19