5

I have a large table with ca. 10 million rows in PostgreSQL 9.4 database. It looks somewhat like this:

gid | number1 | random | result | ...
 1  |    2    |  NULL  |  NULL  | ...
 2  |   15    |  NULL  |  NULL  | ...
... |   ...   |   ...  |  ...   | ...

Now I would like to update the columns random and result as a function of number1. That means that at least random needs to be produced in a script outside of the database. Since I have limited RAM, I wonder how I could do that efficiently using psycopg2. I believe I face two problems: How to fetch the data without using too much RAM and how to get it back in there. The simpleton approach would look like this:

curs.execute("""SELECT gid1, number1 FROM my_table;""")
data = curs.fetchall()

result = []
for i in data:
    result.append((create_random(i[1]), i[0]))
curs.executemany("""UPDATE my_table
                    SET random = %s
                    WHERE gid = %s;""",
                 results)
curs.execute("""UPDATE my_table
                SET result = number1 * random;""")

However, this will certainly quickly deplete all my memory and take forever to UPDATE my_table.

What would be a smarter strategy? The database is being accessed exclusively and can be locked. The PostgreSQL random function is unfortunately not suitable for my case.

n1000
  • 5,058
  • 10
  • 37
  • 65
  • Define “to much RAM”, because 10 million * 3 integers is 240MB, which is more than manageable by many systems. Also, could your generating logic be implemented as a stored procedure? – spectras Oct 21 '15 at 10:16
  • Can you use (PL/Python)[http://www.postgresql.org/docs/9.4/static/plpython.html]? If so you can write your own replacement for Postgres's random function and run it all as one `UPDATE` statement inside the database. – foz Oct 21 '15 at 10:40
  • @foz Oh - interesting! But I assume this wouldn't work in my case because I need numpy packages...? – n1000 Oct 21 '15 at 11:38
  • @n1000 well I haven't tried it myself, but there is an example here that imports numpy inside a PL/Python function: https://github.com/ihuston/plpython_examples/blob/master/simple_examples.sql – foz Oct 21 '15 at 14:19

1 Answers1

5

unnest the data to make it all at once:

def create_random(i):
    return random() * i

curs.execute("select gid, number from t;")
data = curs.fetchall()

results = []
for i in data:
    results.append((create_random(i[1]), i[0]))

curs.execute("""
    update t
    set
        rnd = s.rnd,
        result = number * s.rnd
    from unnest(%s) s(rnd numeric, gid integer)
    where t.gid = s.gid;
""", (results,))

con.commit()

Table t:

create table t (
    gid integer,
    number integer,
    rnd float,
    result float
);
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • 1
    Wow - the `UPDATE` is 500% faster than using `executemany`. Thanks! So you think fetching all into RAM and putting it back with `unnest` is OK otherwise? I'm just afraid I might run out of RAM for larger tables and more columns. No need for for named cursors or something like that? – n1000 Oct 22 '15 at 18:36
  • @n1000 It is OK as long as you think it is OK to take that client/server round trip. A PL/Python function would make it much simpler than dealing with cursors and client/server round trips. – Clodoaldo Neto Oct 22 '15 at 19:55