How can I iterate over a large table without loading all into memory?

Question

I am trying to iterate over a really large table.

iter(self.session.query(Trade).order_by(Trade.time).yield_per(1))

or

for t in self.session.query(Trade).order_by(Trade.time).yield_per(1):
    prev = t

blocks until the computer crashes because it runs out of memory. yield_per(1) doesn't seem to do anything. How can I iterate over a large table without loading all into memory at once?

1. Can you also mention the DB you're using? 2. Does it crash, when you're doing exactly like you second example (with only `pass` in the body) or you have some other code in the body? — Yaroslav Admin, Oct 25 '15 at 21:01
I am using MySql. I tried to run my example, and it doesn't consume memory with only `pass` in the body, but if I replace `pass` with `prev = t`, it starts to consume memory up to 100%. — ni9e, Oct 25 '15 at 21:07
Actually, it still does with `pass` in the body of the for loop. I just got a little bit confused, because it takes some time before the memory consumption starts. — ni9e, Oct 25 '15 at 21:13
See Non Window Function Version in the [SQLAlchemy wiki](https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/WindowedRangeQuery). — Yaroslav Admin, Oct 26 '15 at 01:05
Without looking into details: even if you read data from database in chunks, will not all the objects still end up in the session anyways, and as such will be consuming memory? In this case you probably use [Expunging](http://docs.sqlalchemy.org/en/rel_1_0/orm/session_state_management.html#expunging) to remove objects from memory. To test this, try yielding per larger chunk (1000), but call `session.expunge_all()` after each 1000 iterations and see if memory is released back. — van, Oct 26 '15 at 07:42

score 1 · Answer 1 · answered Oct 26 '15 at 19:05

I'm answering my own question. The 'WindowedRangeQuery' on the SQLAlchemy wiki page (link provided by @Yaroslav Admin) would be ideal, but apparently MySql lacks 'window functions'. The 'Non Window Function Version' method mentioned on the wiki page, scales better, but makes too many assumptions about the data.

This solution works, but it becomes slower for each page due to the added offset:

def get_all_trades(self):
    pagesize = 1000000
    row_count = self.session.query(Trade).count()
    offset = 0
    while offset < row_count:
        query = self.session.query(Trade). \
                order_by(Trade.time). \
                slice(offset, offset+pagesize). \
                limit(pagesize)
        offset += pagesize
        for d in query:
            yield d

I'm not really happy with the solution. Maybe I should switch to Postgresql to get those window functions going.

How can I iterate over a large table without loading all into memory?

1 Answers1