0

Problem

It takes a long time to query our Oracle database with python using cx_Oracle.

Method

I connect to oracle db using cx_Oracle, and query using a standard SQL line like:

select * from t1;

Results are returned as a pandas dataframe.

Observations

  • Querying from Spyder and Jupyter Notebooks are equally slow.
  • Querying from an SQL client like DBeaver returns results in ~1/10th the time.

Caveat

I have not tested if this holds true when I include row-limits in the cx-Oracle queries.

MinneapolisCoder9
  • 601
  • 1
  • 11
  • 29
  • Your question is a bit generic. Without additional information (like what data you are querying and what code you are using to query it) I can't answer your question. I find the performance of cx_Oracle adequate in most cases, somewhat slower than the equivalent C program but with a lot less effort. Since you are using a dataframe that implies you are using something like Pandas. Try using cx_Oracle directly as well. – Anthony Tuininga Mar 30 '20 at 15:51

3 Answers3

2
Christopher Jones
  • 9,449
  • 3
  • 24
  • 48
0

@Christopher Jones and @Anthony Tuininga pointed to next level answers in solving why queries were returning so slowly. In my #Caveat section, I noted that I had no

WHERE rownum < 10

in my queries. Including this did indeed reduce query times. My bad.

Also, a command like

create table t1 as
select *
from table1

results in instant feedback, since nothing is being received from the db. It was just a matter of issuing a query that demanded a lot of data back. An SQL client has those row limits built in to prevent people like me from asking these questions! :)

MinneapolisCoder9
  • 601
  • 1
  • 11
  • 29
0

Addressing Slow Write Speeds

When writing to Oracle, all df columns which are String objects are by default converted to CLOB (Character large objects), which are max 2,147,483,647 in length). This means that tables are very likely far larger than necassary.

When using df.to_sql, you typically use something like the following:

df.to_sql("tablename"
        connection_string,
        schema='myschema',
        if_exists='replace',
        index=False
        )

However, omitting the optional parameter dtypes may result in CLOBs.

A Solution

To overcome this, you either manually provide dtypes:

df.to_sql("tablename"
        connection_string,
        schema='myschema',
        if_exists='replace',
        index=False,
        dtype={"RUN_DT": sa.types.DateTime(),
        "fname": sa.types.VARCHAR(length=50),
        "lname": sa.types.VARCHAR(length=250),
        "FROM_DT": sa.types.DateTime(),
        "TO_DT": sa.types.DateTime(),
        "UUID": sa.types.VARCHAR(length=50)}
        )

A Better, Generalized Solution

But what if you have many tables and many columns and you need a generalized solution?

@Parfait provided the perfect solution to this issue on SO here.

MinneapolisCoder9
  • 601
  • 1
  • 11
  • 29