There are multiple tables containing more than 1 million records, and I have to do something like :
for record in GiantTable.objects.all():
doSomethingAwesome(record)
I have tried:
- Django paginator, but again it does multiple calls to DB.
- Iterator, but it doesn't really helped in optimizing time and memory.
- Also have used offset and limit as suggested here but again it does multiple db calls.
Finally I ended up writing a function and creating cursor in postgreSQL:
CREATE FUNCTION reffunc(refcursor) RETURNS refcursor AS $$
BEGIN
OPEN $1 FOR SELECT * FROM giantTable;
RETURN $1;
END;
$$ LANGUAGE plpgsql;
as documented in psycopg2 and calling this cursor as:
import psycopg2
conn = psycopg2.connect("dbname=test user=root")
cur1 = conn.cursor() cur1.callproc('reffunc', ['curname'])
cur2= conn.cursor('curname')
for rec in cur2.fetchall():
doSomethingAwesome(rec)
I have two doubts here:
As docummented in psycopg2,
the default value of 2000 allows to fetch about 100KB per roundtrip assuming records of 10-20 columns of mixed number and strings
but does this mean postgreSQL hits DB each time after 2000 records, how it returns results? because when I did
len(cur2.fetchall())
it took time and returned me total number of records. Should I write another function in postgreSQLFETCH 2000 FROM gigantic_cursor;
and call this function instead from django after every 2000 records?Should I go with this approach and do the same for all other tables. Is this the most optimal way to reduce memory and time both or have I misunderstood something?