Tuning Postgresql performance and memory use in a python workflow

Question

I use Postgresql 9.4 for a model database. My table looks somewhat like this:

CREATE TABLE table1 (
sid INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('table1_sid_seq'::regclass),
col1 INT, 
col2 INT,
col3 JSONB);

My Python 2.7 workflow often looks like this:

curs.execute("SELECT sid, col1, col2 FROM table1")
data = curs.fetchall()
putback = []
for i in data: 
    result = do_something(i[1], i[2])
    putback.append((sid, result))
del data
curs.execute("UPDATE table1
              SET col3 = p.result
              FROM unnest(%s) p(sid INT, result JSONB)
              WHERE sid = p.sid", (putback,))

This typically works quite well and efficiently. However, for large queries Postgresql memory use will sometimes go through the roof (>50GB) during the UPDATE command and I believe it is being killed by OS X, because I get the WARNING: terminating connection because of crash of another server process. My Macbook Pro has 16GB of RAM and the query in question has 11M lines with each about 100 charactes of data to write back.

My postgresql.conf:

default_statistics_target = 50
maintenance_work_mem = 512MB 
constraint_exclusion = on 
checkpoint_completion_target = 0.9
effective_cache_size = 4GB 
work_mem = 256MB 
wal_buffers = 16MB 
checkpoint_segments = 128 
shared_buffers = 1024MB 
max_connections = 80

So I wonder

Why is my query consuming sometimes excessive amounts of RAM?
How can I control memory use and still guarantee good performance?
Is there a good guideline or tool for tuning Postgresql?

Update:
I am pretty sure that @wildplasser pinpointed my problem. In the comments he suggests to dump the data into the database first, and unpack it from there. Unfortunately I could not figure out how to implement his proposal. If anyone has an idea how to do that, their answer will be gladly accepted.

1) your work_mem is (rather) high , and you (probably) have no table structure. 2) design your database 3) see 2 BTW: your select query fetches *all* the rows (and I don't understand your update query.) — wildplasser, Dec 25 '15 at 15:23
Yes. Your table might look like a spreadsheet, I don't know. Yes. — wildplasser, Dec 25 '15 at 15:27
BTW: I don;t understand a word about you python stuff, but it *looks* like you are sucking the entire db-table into a python set or array, and use that (in exploded form) to update the same table. — wildplasser, Dec 25 '15 at 15:33
@wildplasser Yes, this is what I basically do. `unnest` basically throws all data at the database. normally this works fine. but now I encountered a case where memory use is extremely high and I am not sure why and I would like to control that. — n1000, Dec 25 '15 at 15:36
I would advise to first save "putback" into a temp table (plus: add a PK to this temp) , and do the update from there. (as I understand it, the way it works now is: the "unnest" first builds a *huge* array, and then unpacks it; all in-memory) — wildplasser, Dec 25 '15 at 15:50
No, you should *not* construct the temp table by exploding a huge array (which is very hard for the parser if I understand your ORM correctly) . Instead explode the "array" locally, and copy it to a remote table (take care not to operate on a tuple-at-atime basis, which would take a lot of time (but would not fail) — wildplasser, Dec 25 '15 at 22:32
@wildplasser Unfortunately I couldn't figure out how to implement your suggestion. I am pretty sure your advice is what I need (see updated question and my workaround answer). Thanks anyway! — n1000, Dec 26 '15 at 13:18
Please add a (dummy) version for `do_something(int, int)`, so that we can test without needing to learn python and jsonb. — wildplasser, Dec 26 '15 at 13:36

score 1 · Answer 1 · edited May 23 '17 at 10:27

My workaround is to slice putback with a simple function as proposed here:

def chunk(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

and then

for chunk in chunk(putback, 250000):
    curs.execute("UPDATE table1
                  SET col3 = p.result
                  FROM unnest(%s) p(sid INT, result JSONB)
                  WHERE sid = p.sid", (chunk,))

This works, i.e. keeps the memory footprint in check, but is not very elegant and slower than dumping all data at once, as I usually do.

Tuning Postgresql performance and memory use in a python workflow

1 Answers1

Linked