Using Python/PyGreSQL, how can I efficiently handle a large result set?

Question

I have a query result set of ~ 9-million rows.

I need to do some processing for each row, and the code currently does this:

query = conn.query(sql)
results = query.getresult()

for row in results: 
    # blah

I'm not sure, but I imagine that getresult() is pulling down the entire result set. Is that the case? I imagine there's a way to only pull chunks of the result set across the wire as needed, but I didn't immediately see something like that in the pg module docs.

Is it possible to do this with pgdb module instead, or some other approach?

My concerns are for memory on the application machine - I'd rather not load millions of rows into memory all at once if I can help it.

Is this even worth worrying about?

According to the pygresql docs, getresult returns a pyqueryobject, with various lists holding all of the data for the associated query. You'll need to see if one of python's alternative db access libraries includes an iterator for query results. — Spencer Rathbun, Sep 19 '11 at 20:45

score 3 · Accepted Answer · answered Sep 19 '11 at 20:47

3

If it's following the Python Database API spec, you could use a cursor:

curs = conn.cursor()
curs.execute('select * from bigtable')

then use curs.fetchone() or curs.fetchmany(chunksize)

answered Sep 19 '11 at 20:47

John Machin

81,303
11
141
189

I used `for row in iter(curs.fetchone()):` - perhaps that was extra explicit? It works as expected. – anonymous coward Sep 20 '11 at 18:00
@anonymous: ??? If it's following the DB API, curs.fetchone() will fetch the first row. Iterating over that gives you the column values of the first row. Perhaps you meant `for row in iter(curs.fetchone, None):` – John Machin Sep 20 '11 at 21:52

score 2 · Answer 2 · answered Sep 19 '11 at 20:40

2

pgdb's cursors are iterators

cursor = conn.cursor()
cursor.execute(sql)

for row in cursor:
   # do something with row

where conn is created from pgdb.connect(...)

answered Sep 19 '11 at 20:40

Dan D.

73,243
15
104
123

Help me understand: being an 'iterator', does that imply that the entire result set is *not* loaded into memory, but only pulled from the postgres server as required (by each iteration)? – anonymous coward Sep 19 '11 at 22:08
@anonymouscoward: yes; `cursor.next()` just calls `fetchone()` and raises `StopIteration` if there's no result. It's probably more pythonic to just treat the cursor as an iterator than to manually call `fetchone()`. – Wooble Sep 26 '11 at 00:49

score 0 · Answer 3 · answered Sep 19 '11 at 20:37

I'm not sure how getresult() behaves but another option would be PL/Python:

The PL/Python procedural language allows PostgreSQL functions to be written in the Python language.

That would let you work right inside the database. This might not be suitable for what you need to do but it is worth a look.

score 0 · Answer 4 · answered Sep 26 '11 at 16:56

use cursor.fetchmany() and make sure you explicitly set arraysize to handle sets of rows that gives you the balance you need between performance and memory utilization.

I have jobs written in cx_Oracle (which also uses the DB-API spec) and use it to move tables with several billion rows across the network in batches of 20,000 records. It takes a while, but I'm not blowing out my server memory on either the source or target side.

Using Python/PyGreSQL, how can I efficiently handle a large result set?

4 Answers4