SQLAlchemy: Scan huge tables using ORM?

Question

I am currently playing around with SQLAlchemy a bit, which is really quite neat.

For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...

For fun I did the equivalent of a select * over the resulting SQLite database:

session = Session()
for p in session.query(Picture):
    print(p)

I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.

Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?

Noticed this same problem myself. If I do `len(Session.query(Model).limit(100).all())`, I get `1`. If I remove the `limit`, memory use skyrockets. — Sarah Vessels, Aug 03 '11 at 20:47

score 62 · Answer 1 · edited Mar 12 '15 at 13:19

62

Okay, I just found a way to do this myself. Changing the code to

session = Session()
for p in session.query(Picture).yield_per(5):
    print(p)

loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs

WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten. In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.

So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?

edited Mar 12 '15 at 13:19

Zitrax

19,036
20
88
110

answered Jul 17 '09 at 22:23

Bluehorn

2,956
2
22
29

9

yield_per is safe when you have one result SQL row per resulting instance. You get extra rows per instance when you eagerload or join a one-to-many relation. If you need more details about the circumstances, you could create a separate question about yield_per. – Ants Aasma Jul 28 '09 at 14:48
Something to note: This throws an exception if you try to commit while doing this. See http://stackoverflow.com/questions/12233115/sqlalchemy-cursor-error-during-yield-per – Theron Luhn Feb 02 '13 at 21:26
5

After a long battle with memory leaks in MySQL I saw this in the yield_per docs: `Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available.` Workaround here: http://stackoverflow.com/a/3699677/281469 – bcoughlan Dec 10 '13 at 15:13

zzzeek · Answer 2 · 2012-07-14T14:22:09.767

37

here's what I usually do for this situation:

def page_query(q):
    offset = 0
    while True:
        r = False
        for elem in q.limit(1000).offset(offset):
           r = True
           yield elem
        offset += 1000
        if not r:
            break

for item in page_query(Session.query(Picture)):
    print item

This avoids the various buffering that DBAPIs do as well (such as psycopg2 and MySQLdb). It still needs to be used appropriately if your query has explicit JOINs, although eagerly loaded collections are guaranteed to load fully since they are applied to a subquery which has the actual LIMIT/OFFSET supplied.

I have noticed that Postgresql takes almost as long to return the last 100 rows of a large result set as it does to return the entire result (minus the actual row-fetching overhead) since OFFSET just does a simple scan of the whole thing.

edited Jul 14 '12 at 14:22

answered Aug 02 '09 at 01:28

zzzeek

72,307
23
193
185

I think you can get rid of one assignment in the loop, like: while True: elem = None; for elem .... offset += 1000; if elem==None: break; offset += 1000 – Dave Jan 06 '11 at 17:02
No, Dave, you need to set `elem = None` at the end of the loop or it will never terminate. – George V. Reilly Jul 14 '12 at 02:18
5

Note that slicing the query will give you the same effect: `q[offset:offset+1000]`. – Martijn Pieters Aug 24 '12 at 16:01
see also @zzzeek 's [more recent answer on a similar subject](http://stackoverflow.com/a/7390660/884640) ... linking to [more info and code on wiki](http://www.sqlalchemy.org/trac/wiki/UsageRecipes/WindowedRangeQuery) – floer32 Jun 28 '16 at 00:07

David Raznick · Answer 3 · 2009-07-17T23:23:26.600

9

You can defer the picture to only retrieve on access. You can do it on a query by query basis. like

session = Session()
for p in session.query(Picture).options(sqlalchemy.orm.defer("picture")):
    print(p)

or you can do it in the mapper

mapper(Picture, pictures, properties={
   'picture': deferred(pictures.c.picture)
})

How you do it is in the documentation here

Doing it either way will make sure that the picture is only loaded when you access the attribute.

edited Jul 17 '09 at 23:23

answered Jul 17 '09 at 23:06

David Raznick

17,907
2
35
27

Thanks David, that is quite interesting. In fact I just changed my code to use from sqlalchemy.orm import deferred ... class Picture(object): ... imagedata = deferred(Column(Binary)) Which is a big improvement. However, it still takes a few seconds until the first result is output as a few 1000 Picture objects are created for the query. I would like to have those objects created one at a time while SQLAlchemy is iterating through the SQLite result rows. – Bluehorn Jul 18 '09 at 08:23
I think that most of the time will be spent retrieving the data from the database and not making the objects. So the query iself needs to have limits on. So. for p in session.query(Picture)[0:5]: print p. – David Raznick Jul 18 '09 at 18:38

SQLAlchemy: Scan huge tables using ORM?

3 Answers3

Linked