26

I am using psycopg2 to query a Postgresql database and trying to process all rows from a table with about 380M rows. There are only 3 columns (id1, id2, count) all of type integer. However, when I run the straightforward select query below, the Python process starts consuming more and more memory, until it gets killed by the OS.

Minimal working example (assuming that mydatabase exists and contains a table called mytable):

import psycopg2
conn = psycopg2.connect("dbname=mydatabase")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable;")

At this point the program starts consuming memory.

I had a look and the Postgresql process is behaving well. It is using a fair bit of CPU, which is fine, and a very limited amount of memory.

I was expecting psycopg2 to return an iterator without trying to buffer all of the results from the select. I could then use cur.fetchone() repeatedly to process all rows.

So, how do I select from a 380M row table without using up available memory?

Carl
  • 1,027
  • 1
  • 9
  • 21

2 Answers2

37

You can use server side cursors.

cur = conn.cursor('cursor-name') # server side cursor
cur.itersize = 10000 # how much records to buffer on a client
cur.execute("SELECT * FROM mytable;")
D Malan
  • 10,272
  • 3
  • 25
  • 50
bav
  • 1,543
  • 13
  • 13
  • Great, thanks for that! I'm going to make a few edits to improve the explanation a bit and then accept. – Carl Feb 05 '15 at 12:07
  • Does this cursor exist in the PostgreSQL-server or the Python-client? DECLARE cursor_name CURSOR FOR SELECT * FROM mytable; – Frank Heikens Feb 05 '15 at 12:08
  • @FrankHeikens Everything is done from the Python client side in this case. No cursor exists server side. – Carl Feb 05 '15 at 12:20
  • 2
    No, that is wrong. When you create a "named cursor" psycopg will create a corresponding cursor on the server side and manage it for you. – fog Feb 05 '15 at 12:33
  • the cursor keeps growing and growing over time as you load more data until I run out of memory. itersize only seems to control how many you fetch from the server at a time, but the cursor in memory still keeps growing.. can't figure out how to make cursor size when reading 5 million records. I might have to read in 10,000 close cursor and open a new one.. :( – radtek Mar 05 '21 at 14:18
  • 1
    @radtek have you found the way to “clear” cursor between fetches? – Ivan Feb 12 '22 at 19:43
8

Another way to use server side cursors:

with psycopg2.connect(database_connection_string) as conn:
    with conn.cursor(name='name_of_cursor') as cursor:

        cursor.itersize = 20000

        query = "SELECT * FROM ..."
        cursor.execute(query)

        for row in cursor:
            # process row 

Psycopg2 will fetch itersize rows to the client at a time. Once the for loop exhausts that batch, it will fetch the next one.

Demitri
  • 13,134
  • 4
  • 40
  • 41