3

I am using jaydebeapi (Mac OS X) to query a Netezza database and performing some quick/dirty timing:

t0 = time.time()
curs.execute('''select * from table1;''')
print time.time() - t0

I created the table myself and it contains 650,000 rows and 9 columns (integers and dates).

When I run the above command it takes about 1.3 minutes to complete (averaged over 10 runs).

Then, when I try to fetch the data:

t0 = time.time()
curs.execute('''select * from table1;''')
row = curs.fetchone()
while row is not None:
    row = curs.fetchone()
print time.time() - t0

It takes about 10 minutes to complete (averaged over 10 runs).

Now, when I run the same SQL query using WinSQL (Windows 7, ODBC), it takes about 3 minutes to return the data. I can't seem to figure out why it is taking so much longer in Python and am not sure how or where to start looking.

slaw
  • 6,591
  • 16
  • 56
  • 109
  • What driver are you using? Does it have any settings? BTW do you really need to fetch and process 650k rows entirely with Python, without any SQL? – 9000 Nov 13 '14 at 19:19
  • For Mac OS X its the JDBC driver (ODBC doesn't exist for Macs as far as I know) and ODBC driver for Windows 7. Not sure if this answers your question but I'm going to be slicing and dicing using Pandas. – slaw Nov 13 '14 at 19:31
  • If you're going to use Pandas you might want to just read the data directly into a Pandas DF. See this example: http://stackoverflow.com/a/12060886/1216837 – ACV Nov 13 '14 at 19:48
  • I've seen "pd.read_sql()" and it takes 10 minutes just the same. I can't tell whether that function uses "fetchone()", "fetchmany()", or "fetchall()". – slaw Nov 13 '14 at 19:56

3 Answers3

4

Are you using JayDeBeApi in combination with JPype or together with Jython? Fetching of large result sets with the JPype implementation causes some JNI calls for every single cell value which causes lot's of overhead. You should consider one of the following options:

  1. Minimize the size of your resultset. Do aggregations using SQL functions.
  2. Give the newest implementation of JPype1 a try. There have been some performance improvements.
  3. Switch your runtime to Jython (JayDeBeApi works on Jython as well)
  4. Implement the db queries and data extraction directly in Java and call the logic using JPype but with a interface not returning a large data set.
  5. Try to improve JPype and JayDeBeApi code
bastian
  • 1,122
  • 11
  • 23
  • Yes, I am using JayDeBeApi in combination with JPype as that was the example that I could find. Can you point me to some examples using/setting up JPype1 with JayDeApi. What do you mean by "switch your runtime to Jython"? – slaw Nov 30 '14 at 18:50
  • Is JPype1 available for Python 2.7? – slaw Nov 30 '14 at 18:59
  • Instead of installing JPype with `pip install JPype` you install it via `pip install JPype1`. JPype1 is the official development of JPype and availaable for Python 2.7. – bastian Dec 01 '14 at 06:46
  • Jython is a Python interpreter implemented fully in Java instead of C. You can run your code without modifications if it doesn't use libraries that require native libraries. To set it up with JayDeBeApi you simply install Jython, skip the JPype install and directly install JayDeBeApi. See https://pypi.python.org/pypi/JayDeBeApi/#install for some hints. Feel free to ask again (think about writing a new question thread!) – bastian Dec 01 '14 at 06:51
  • Installed JPype1. Now I'm getting an error when I do "jpype.startJVM(jpype.getDefaultJVMPath(),'-Djava.class.path=/Path/to/some/dir/nzjdbc3.jar')" that says "RuntimeError: Unable to load DLL [/Library/Java/JavaVirtualMachines/1.6.0_65-b14-462.jdk/Contents/Libraries/libjvm.dylib], error = dlopen(/Library/Java/JavaVirtualMachines/1.6.0_65-b14-462.jdk/Contents/Libraries/libjvm.dylib, 9): no suitable image found. Did find: /Library/Java/JavaVirtualMachines/1.6.0_65-b14-462.jdk/Contents/Libraries/libjvm.dylib: mach-o, but wrong architecture at native/common/include/jp_platform_linux.h:45" – slaw Dec 04 '14 at 15:40
  • Don't know what's going on. Have you tried the recommendation in this [github issue](https://github.com/originell/jpype/issues/91)? Otherwise file a new JPype issue there. BTW: You usually don't have to call jpype.startJVM if you use jaydebeapi the right way. Check the usage section of [JayDeBeApi documentation](https://github.com/baztian/jaydebeapi/blob/master/README.rst). – bastian Dec 07 '14 at 09:23
1

You might want to use curs.fetchmany() instead of fetchone. That will optimize, somewhat, the back and forth to fetch the rows.

Something like this will even hide the fact that you are fetching many rows at a time:

def fetchYield(cursor):
        li = []
        while True:
            if not li:
                li = cursor.fetchmany()
                if not li:
                    raise StopIteration
            yield li.pop(0)

for row in fetchYield(curs):
   <do something with row>

However, I think that if a raw sql query tool takes 3 minutes to fetch the data, it is not entirely unreasonable to have your Python code take 3x as long.

JL Peyret
  • 10,917
  • 2
  • 54
  • 73
  • But one should probably also specify how many rows then, right? Otherwise, it defaults to one and would be identical to fetchone()? At least this is the behavior that I see for jaydebeapi. – slaw Nov 13 '14 at 20:37
  • I don't think so. There is a default "fetch size" for fetchmany in most Python db api v2 implementations, I believe. See http://legacy.python.org/dev/peps/pep-0249/#cursor-methods. You can always do a print len(li) on the first iteration to see how many rows it returned. In any case, you specify a default of 100 yourself. def fetchYield(cursor, fetchsize=100). fetchmany(fetchsize) for example. – JL Peyret Nov 13 '14 at 20:39
1

I had a similar problem and I observed an improvement using fetchall and setting the cursor arraysize parameter (detault to 1), as reported in the DB-API documentation on which JayDeBeApi is based.

cursor = conn.cursor()
cursor.arraysize = 10000
cursor.execute("select * from table1")

rows = cursor.fetchall()

# storing data in a pandas DataFrame
df = pd.DataFrame(data=rows, columns = ["C1", "C2", "C3"])

cursor.close()

I observed the following performances on a 600.000 rows fetching

arraysize = 10000 --- 509 seconds
arraysize = 1     --- 526 seconds

Nevertheless, I also observed a much greater fetching time compared, for instance, to a Java-based client using the same JDBC driver. My suggestion, as 9000 was saying, is to expend some time on your SQL query and let the database do the work, it's a faster and much more scalable solution.

Matt07
  • 504
  • 7
  • 21