2

There are multiple tables containing more than 1 million records, and I have to do something like :

for record in GiantTable.objects.all():
    doSomethingAwesome(record)

I have tried:

  1. Django paginator, but again it does multiple calls to DB.
  2. Iterator, but it doesn't really helped in optimizing time and memory.
  3. Also have used offset and limit as suggested here but again it does multiple db calls.

Finally I ended up writing a function and creating cursor in postgreSQL:

CREATE FUNCTION reffunc(refcursor) RETURNS refcursor AS $$
BEGIN
    OPEN $1 FOR SELECT * FROM giantTable;
    RETURN $1;
END;
$$ LANGUAGE plpgsql;

as documented in psycopg2 and calling this cursor as:

import psycopg2

conn = psycopg2.connect("dbname=test user=root")
cur1 = conn.cursor() cur1.callproc('reffunc', ['curname']) 
cur2= conn.cursor('curname') 
for rec in cur2.fetchall():
    doSomethingAwesome(rec)

I have two doubts here:

  1. As docummented in psycopg2,

    the default value of 2000 allows to fetch about 100KB per roundtrip assuming records of 10-20 columns of mixed number and strings

    but does this mean postgreSQL hits DB each time after 2000 records, how it returns results? because when I did len(cur2.fetchall()) it took time and returned me total number of records. Should I write another function in postgreSQL FETCH 2000 FROM gigantic_cursor; and call this function instead from django after every 2000 records?

  2. Should I go with this approach and do the same for all other tables. Is this the most optimal way to reduce memory and time both or have I misunderstood something?

Community
  • 1
  • 1

0 Answers0