To cite SQLalchemy documentation:
The Query.yield_per() method is not compatible with most eager loading schemes, including subqueryload and joinedload with collections.
Warning
Use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=’joined’ or ‘subquery’) since those collections will be cleared for a new load when encountered in a subsequent result batch. In the case of ‘subquery’ loading, the full result for all rows is fetched which generally defeats the purpose of yield_per().
Also note that while yield_per() will set the stream_results execution option to True, currently this is only understood by psycopg2 dialect which will stream results using server side cursors instead of pre-buffer all rows for this query. Other DBAPIs pre-buffer all rows before making them available. The memory use of raw database rows is much less than that of an ORM-mapped object, but should still be taken into consideration when benchmarking.
I really have a problem understanding how yield_per()
works and what exactly is the problem on using this method. Also what is the right way to workaround these problems and keep using this function for iterating over a huge amount of rows.
I'm interested in all constructive information you have, but here are some hint questions:
- How can there be multiple instances of the same row? Only through relationships (if two rows of the iterating table have an FK to the same row in another table)? Is there a problem if you do not know that it happens or you only read attributes on the relationships?
- lazy=’joined’ or ‘subquery’ are not possible, but why exactly? Both of them are simply parts of your query on which you call
yield_per()
.- If they are cleared in a subsequent result batch, then simply load it again. So where is the problem? Or is the only problem that you loose the changes of you're relationships if have made changes?
- In the case of a ‘subquery’ loading, why all rows are fetched? The SQL Server may have to save a big table, but then why not simply return the result in batches one after the other for the entire query?
- In an example in the
yield_per()
docq = sess.query(Object).yield_per(100).options(lazyload('*'), joinedload(Object.some_related))
they deactivate eagerload withlazyload('*')
but keep a single joined load. Is there a way to still useyield_per()
with eagerload? What are the conditions?
- They say
psycopg2
is the only DBAPI which support stream results. So is that the only DBAPI which you can use withyield_per()
? As far as I understandyield_per
uses thecursor.fetchmany()
(example) function of DBAPI which support many of them. And as far as I understandcursor.fetchmany()
supports fetching only parts of the result and does not fetch everything (If it would fetch everything, why the function exists?) - I have the feeling that
yield_per()
is entirely safe (even with eagerload) if you only do read access (for example for statistics). Is that correct?