There's nothing inherently wrong with what you're doing. If memory usage is expanding with the size of the query, then one of a few things is happening:
- You're leaking references to the results you're receiving; for example, putting them in a list somewhere. I suspect you'd know if you were doing that.
- The database bindings or underlying library isn't freeing memory from previous rows while reading new rows from a query. This is usually a bug. It may happen normally if you have debugging features on, but it should not happen by default.
Note that the underlying library may cache a certain amount of data, and you may see a significant amount of memory use as a result, but unless something is catastrophically wrong with the configuration it shouldn't be 3GB.
Here's some simple SQLite code that reproduces what you're doing. When run, it creates a trivial table of fifteen million rows, which is about 180 MB on disk for the version I'm using. It then SELECTs all of this data, throwing away the results, and sleeps so you can examine the result. The resulting process, on my system, uses only 15 MB.
(Note that I ran the create_db
and read_db
passes with separate invocations; creating the database takes some time.)
SQLite can handle this, and any production server-backed database like MySQL and Postgresql should be able to, too. SELECT results are a data stream, and databases should be able to handle streams of unlimited size easily.
import sqlite3
def create_db(conn):
c = conn.cursor()
c.execute('create table test (i integer)')
conn.commit()
max_val = 15000000
chunk = 1000000
for start in xrange(0, max_val, chunk):
print "%i ..." % start
for i in xrange(start, start + chunk):
c = conn.cursor()
c.execute('insert into test (i) values (?)', (i,))
conn.commit()
def read_db(conn):
c = conn.cursor()
c.execute('select * from test')
for x in xrange(15000000):
c.fetchone()
print "Done"
# Sleep forever, to examine memory usage:
while True:
time.sleep(1)
def go():
conn = sqlite3.connect('test.db')
# Pick one:
create_db(conn)
# read_db(conn)
if __name__ == "__main__":
go()
This doesn't answer your question, but I wanted to make it clear that there's nothing wrong with what you're doing--you should not need to manually chunk queries, though in the end that's probably the workaround you'll need.