33

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?

import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

3 Answers3

65

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.

Instead of

cursor = conn.cursor()

write

cursor = conn.cursor(name="my_cursor_name")

or simpler yet

cursor = conn.cursor("my_cursor_name")

The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors

I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include "DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.

The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

joeblog
  • 1,101
  • 11
  • 17
  • 1
    i couldn't find anything in `sqlalchemy` that helped me avoid the OOM issue, but this solution worked for me. thank you! – ryantuck Nov 21 '17 at 23:13
  • 2
    @RyanTuck It appears that you can accomplish this in `sqlalchemy` by passing `server_sider_cursors=True` to `create_engine`: http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=server_side_cursors#psycopg2-connect-arguments – FoxMulder900 Jul 30 '18 at 16:26
  • @joeblog can you please shade more light on your answer above? Are you declaring the variable and give it a value (i.e. `my_cursor_name = 1000` before creating the cursor? I want to adopt your solution setting the `FECTH_COUNT` from my python script, before calling `cursor.execute('query')` – arilwan Apr 25 '19 at 20:33
  • The (sole) argument in conn.cursor is called name, and I've just assigned "my_cursor_name" to it as a Python string. I imagine it can be anything provided it can be translated into a PostrgreSQL identifier. You wouldn't assign 1000 to the cursor name, but to cursor.itersize on rare occasions when the default 2000 is too high. – joeblog Apr 27 '19 at 06:48
  • 2
    @FoxMulder900 For the record, it looks like `server_side_cursors` has been replaced by `stream_results` – Joril Nov 10 '20 at 16:23
  • server side cursor worked for me .Thanks!!! instead of using fetchall and geting into segmentation fault, it worked marvellously!!! – Nikhil S Mar 22 '21 at 08:59
  • Still, why doesn't `psycopg` release memory? This is a workaround – Kiruahxh Aug 22 '23 at 11:49
  • The problem isn't that psycopg doesn't release memory, but that the client machine's RAM is getting overwhelmed by asking for too much data. Having gained a bit more experience with Postgres over the years, the better solution is learning PL/pgSQL and keeping the work server side rather than trying to do too much client side with Python. – joeblog Aug 23 '23 at 09:10
12

Please see the next answer by @joeblog for the better solution.


First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.

Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.

Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.

Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.

What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.

The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.

It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.

See also:

So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Thanks! Confirmed that memory is reused by running the above code twice in the same process. Memory didn't increase during the second run. – Adam Berlinsky-Schine Jun 20 '13 at 20:15
  • 7
    While everything said here is right, a query result normally will be transferred entirely on the client side (not by `fetch*()` but by `execute()`). So while using `fetchmany()` instead of `fetchall()` may save some memory in terms of Python objects creation, using a server-side cursor as suggested by @joeblog is the right solution. – piro Jul 02 '15 at 17:18
12

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.

import datetime as dt
import psycopg2
import sys
import time

conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server

curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
   cptLigne += 1
   if cptLigne % 10000 == 0:
      print('.', end='')
      sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()

As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

Le Droid
  • 4,534
  • 3
  • 37
  • 32