0

I have a very large table(3.7 GB) stored in PostgreSQL database which I am loading using psycopg2, with the following code:

conn = psycopg2.connect(host="localhost", port = 5432, database="postgres", user="postgres", password="root")

cur = conn.cursor('cursor1')
cur.itersize = 10000 
cur.execute("""SELECT * FROM customers""")
query_results = cur.fetchall()
cur.close()
conn.close()

Python uses 13GB memory to load the table for the first time and every time I run this code again, the memory usage increases until it becomes 99%; then drops to 13GB level again. Also, each time the query takes longer time to execute: First time 34 seconds, second time 46 seconds and it stays around ~45 seconds in the next runs.

I have searched for a solution and this answer seemed to provide an answer, but adding itersize didn't work for me either; the memory usage stayed the same

What should I do to decrease the memory usage? I used gc.collect() and it didn't worked either.

I am using Python 3.9 with PostgreSQL 14 on Windows10 machine and I have 32GB RAM.

kron_ninja
  • 61
  • 8
  • 1
    Do you actually, honestly need to select all rows and all columns of the table? In other words, what's your actual use case here? – AKX Jan 10 '22 at 08:39
  • If you need to do some operation for ALL the customers, then do it with some `batching logic`, for e.g. customers in certain ID range, or name starting 'A' to 'C', and then 'D' to 'F', something so that you _don't need_ to just bring _all_ records from that large table to your client side. – Anand Sowmithiran Jan 10 '22 at 09:14
  • The above mechanism _also_ enables you to take advantage of indexes(if any), if you frame your WHERE clause appropriately. – Anand Sowmithiran Jan 10 '22 at 09:17
  • @AKX At some point I do, to make some calculations based on the whole table. But my main point is different: Why 3.7GB table takes 13GB memory? I get that program size would be more than 3.7GB, but not 13GB. I am guessing cursor takes the memory and then does not free it, but not sure how to do that. – kron_ninja Jan 10 '22 at 10:03
  • @AnandSowmithiran as I said, main problem is table takes 4x of its size in the Python memory and I want to solve that problem. – kron_ninja Jan 10 '22 at 10:04
  • you are comparing apples and oranges - I guess 3.7gb is disk space occupied by that table, it doesn't mean client library that brings data in memory will also only take that amount of memory. If you needed to perform some calculations on whole table, you need to look at other ways - say using data frames and `Dask` – Anand Sowmithiran Jan 10 '22 at 10:13
  • @kron_ninja Since your data is already in a spectacularly capable RDBMS, do those full-table computations in PostgreSQL instead. – AKX Jan 10 '22 at 10:56
  • @kron_ninja Python objects, as `psycopg2`'s cursor returns by default, can have significant overhead on top of the data. For instance, the in-memory size of `l = list(range(10000))` isn't e.g. (64 bits / 8 bits per byte) * 10000 = 80 kBytes, it's `sys.getsizeof(l) + sum(sys.getsizeof(o) for o in l)`, 360 kBytes. A columnar data format, such as how Pandas stores data, could be better, but then you'll need to deal with the overhead of loading such a dataframe. – AKX Jan 10 '22 at 10:59
  • You don't say how you measured the table size in Postgres? It also is important to note that Postgres uses [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) to store large values of selected types in a compressed form. So what you see in the database may not equal the uncompressed size in the client. – Adrian Klaver Jan 10 '22 at 16:40

1 Answers1

0

Around 4 fold expansion for holding the data in pythonic form seems about right (using pgbench_accounts as the benchmark). Python manages memory automatically, not particularly efficiently. If you want to micromanage the brains out of your memory usage, you should write in C, not python.

Using a named cursor to read the data when you want it all stored in memory anyway is nearly pointless. Now if you wanted to stream the data through python, like with for row in cur, then that would save a lot of memory on its own, and using a named cursor on top of that would save even more.

Your comment about using more and more memory every time it runs doesn't make any sense. Your code only executes the SQL once. You can't expect intelligent commentary on the invisible parts of your code.

jjanes
  • 37,812
  • 5
  • 27
  • 34