28

I have a table with 4million rows and I use psycopg2 to execture a:

   SELECT * FROM ..WHERE query

I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.

I find the documentation a bit limited and I have some basic questions.

First I declare the server-side cursor as:

cur = conn.cursor('cursor-name')

then I execute the query as:

cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)

My question is: What do I do now? How do I get the results?

Do I iterate through the rows as:

row = cur.fetchone()
while row:
   row = cur.fetchone()

or I use fetchmany() and I do this:

row = cur.fetchmany(10)

But in the second case how can I "scroll" the results?

Also what is the point of itersize?

General Grievance
  • 4,555
  • 31
  • 31
  • 45
user1919
  • 3,818
  • 17
  • 62
  • 97

3 Answers3

57

Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:

with psycopg2.connect(database_connection_string) as conn:
    with conn.cursor(name='name_of_cursor') as cursor:
    
        cursor.itersize = 20000

        query = "SELECT * FROM ..."
        cursor.execute(query)

        for row in cursor:
            # process row 

The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.

When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.

If you want to use fetchmany for some reason, you could do something like this:

while True:
    rows = cursor.fetchmany(100)
    if len(rows) > 0:
        for row in rows:
            # process row
    else:
        break

This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)

Shiva
  • 2,627
  • 21
  • 33
Demitri
  • 13,134
  • 4
  • 40
  • 41
  • 13
    Naming the cursor is crucial. Per [the docs](http://initd.org/psycopg/docs/connection.html): "If name is specified, the returned cursor will be a server side cursor (also known as named cursor). Otherwise it will be a regular client side cursor". – unutbu Mar 06 '19 at 22:38
  • 3
    One situation in which the fetchmany example you provided has an advantage over iterating over each row individually is if you need to also query a secondary database to compare values and want to reduce the number of calls. You can dynamically build a query to the second database based on the values returned by fetchmany, minimizing both memory usage and server requests. – Maile Cupo Feb 07 '20 at 15:41
  • Anyway we can also get column names when using server side cursors? – user2549538 Jul 13 '20 at 15:31
4

I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.

from uuid import uuid4
import psycopg2

def fetch_things() -> Iterable[MyDomainObject]:
    with psycopg2.connect(database_connection_string) as conn:
        with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
            cursor.itersize = 500_000

            query = "SELECT * FROM ..."
            cursor.execute(query)

            for row in cursor:
                yield MyDomainObject(row)

I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.

nackjicholson
  • 4,557
  • 4
  • 37
  • 35
2

Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:

cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on
Abelisto
  • 14,826
  • 2
  • 33
  • 41
  • @user1919 Yes, I seen. But server-side cursors is faster on opening. In any case you have the alternative. Just try both ways: `fetchmany()` in the loop or mentioned above. – Abelisto Jan 03 '17 at 15:08
  • 2
    Hmm. I see. But based on this answer I thought that I already use servers side cursor: http://stackoverflow.com/questions/28343240/psycopg2-uses-up-memory-on-large-select-query – user1919 Jan 03 '17 at 15:17
  • 1
    Also here they describe the same way to make the server-side cursor: http://grokbase.com/t/postgresql/psycopg/11aeyymbc6/using-server-side-cursor – user1919 Jan 03 '17 at 15:19
  • 1
    @user1919 It seems you are right. When provide the cursor name its like that the `psycopg2` uses the same way. Ok, just mention if my answer useless - I will delete it. – Abelisto Jan 03 '17 at 15:28
  • Thanks. Good to verify. Your answer is not useless. I just still don't understand if I use fetchmany, how I will "scroll" through the results. I guess I have actually to use a while loop also. – user1919 Jan 03 '17 at 15:33
  • 2
    @user1919 Yes, what I wrote above is what the `psycopg2` doing internally, so you can use [PostgreSQL doc](https://www.postgresql.org/docs/current/static/sql-declare.html) to understand what is happening. Good luck. – Abelisto Jan 03 '17 at 15:43