105

I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:

for thing in session.query(Things):
    analyze(thing)

To avoid this, I find I have to build my own iterator that bites off in chunks:

lastThingID = None
while True:
    things = query.filter(Thing.id < lastThingID).limit(querySize).all()
    if not rows or len(rows) == 0: 
        break
    for thing in things:
        lastThingID = row.id
        analyze(thing)

Is this normal or is there something I'm missing regarding SA built-in generators?

The answer to this question seems to indicate that the memory consumption is not to be expected.

Community
  • 1
  • 1
Paul
  • 2,973
  • 6
  • 31
  • 40

7 Answers7

129

Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.

But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.

So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.

I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.

The window function approach is on the wiki and I use it with great success.

Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.

Ninjakannon
  • 3,751
  • 7
  • 53
  • 76
zzzeek
  • 72,307
  • 23
  • 193
  • 185
  • You mention Query instanciates everything to compare identities. Could this be avoided by sorting on the primary key, and only comparing consecutive results? – Tobu Nov 25 '12 at 15:34
  • the issue is if you yield an instance with identity X, the application gets a hold of it, and then makes decisions based on this entity, and maybe even mutates it. Later, perhaps (actually usually) even on the very next row, the same identity comes back in the result, perhaps to add more contents to its collections. The application therefore received the object in an incomplete state. sorting doesn't help here because the biggest issue is the workings of eager loading - both "joined" and "subquery" loading have different issues. – zzzeek Nov 25 '12 at 19:21
  • I understood the "next row updates the collections" thing, in which case you only need to look ahead by one db row to know when the collections are complete. The implementation of eager loading would have to cooperate with the sort, so that collection updates are always done on adjacent rows. – Tobu Nov 25 '12 at 22:11
  • 1
    the yield_per() option is always there for when you're confident the query you're emitting is compatible with delivering partial result sets. I spent a marathon several-days session trying to enable this behavior in all cases, there were always obscure, that is, until your program uses one of them, edges that failed. In particular, relying upon ordering can't be assumed. As always, I'm welcome to actual code contributions. – zzzeek Nov 26 '12 at 14:39
  • I'm sorry for heckling you. I just wanted to know more on what made this difficult. – Tobu Nov 26 '12 at 14:46
  • 1
    it was mostly a question of how complicated it had to get to encompass every possible use case, including when the same group of classes might be encountered multiple, disconnected times in a result, as well as how to do the abovementioned "check for the end of the joined result" logic, which would have to work recursively. Once you make the behavior implicit, then you're on the hook for all the bugs going forward. If i had the resources of a team of developers behind me, maybe it's something supportable, but it was getting to be extremely complex for not much of a benefit. – zzzeek Nov 26 '12 at 18:31
  • and sqlalchemy already has a huge amount of complexity im constantly trying to reduce, while maintaining the behavioral contract. – zzzeek Nov 26 '12 at 18:32
  • @zzzeek How would you suggest applying the WindowedRangeQuery usage recipe on a very large, filtered data set with no natural column to pull ranges from? We're digging deep into our database to search for "needle in the haystack" kinds of rows. Only in this case our haystack is so large that we exceed 800k rows returned and the usual OFFSET/LIMIT approach slows to a crawl. I attempted using ROW NUMBER on the entire data set to return ranges of rows to return, but quickly realized my mistake. Wish there was an easier way to return paginated results from a database... – Vijchti Apr 15 '14 at 00:37
  • virtually any relational table has some kind of unique candidate key, consisting of some group of columns. if the table truly has fully duplicate rows then it's not clear what the meaning is of pure dupes and why you'd have those. – zzzeek Apr 15 '14 at 05:47
  • @zzzeek thanks for you explanation, the suggested window-function approach will may not work for me because while paginating through data result some rows may change(between two SELECTs), but i need data that was actual when first SELECT came in. I can't load all result in memory because it is too big. So I am actually testing yeild_per, looks like memory remains under control. I want to use yield_per function and trying to understand if it suitable for my case. Is there any example with more detailed explanation when and why joins or subqueries become an issue? – schatten Oct 02 '14 at 01:54
  • @zzzeek I have read this thread and the manual page couple of times, but still can't understand when and why a query may become not good to use with yield_per. Thanks. – schatten Oct 02 '14 at 01:55
  • 1
    Since I am using postgres looks like it is possible to use Repeatable Read read-only transaction and run all the windowed queries in that transaction. – schatten Oct 02 '14 at 02:10
  • @zzzeek In the WindowedRangeQuery usage recipe answer you linked, what is the double modulo (%%) operator? I can't find anything on it, is it a typo? – cozos Apr 03 '18 at 22:12
  • Even in SQLAlchemy Core 1.1.x, ResultProxy seems to try to fetch every row (and it's not an iterator either). I am currently falling back to the raw db-api cursor object to iterate over both cx_Oracle and psycopg2 vanilla, non-serverside cursors. So the behavior is clearly not the DB-API implementation's behavior. It's a fair tradeoff between code complexity (I can no longer use SQLAlchemy for dialect abstraction) and memory use. – cowbert Aug 20 '18 at 21:51
  • I must admit, coming from an Oracle / C background I found the first paragraph very surprising at first. I'd assumed all most / all DBMS were capable of transparently paginating their query results. – Philip Couling Jul 27 '21 at 23:13
49

I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...

Here is a dummy example:

from sqlalchemy import create_engine, select

conn = create_engine("DB URL...").connect()
q = select([huge_table])

proxy = conn.execution_options(stream_results=True).execute(q)

Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:

while 'batch not empty':  # equivalent of 'while True', but clearer
    batch = proxy.fetchmany(100000)  # 100,000 rows at a time

    if not batch:
        break

    for row in batch:
        # Do your stuff here...

proxy.close()

This method allowed me to do all kind of data aggregation without any dangerous memory overhead.

NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...

There is an interesting usecase in this blog post that inspired my above method.

edthrn
  • 1,052
  • 12
  • 17
  • 3
    If one is working on postgres or mysql (with `pymysql`), this should be the accepted answer IMHO. – Yuki Inoue Apr 10 '19 at 11:15
  • 2
    Saved my life, was seeing my queries running slower and slower. I've instrumented the above on pyodbc (from sql server to postgres) and it's running like a dream. – Ed Baker Sep 13 '19 at 01:37
  • 2
    This was for me the best approach. As I am using ORM, I needed to compile the SQL to my dialect (Postgres) and then execute directly from connection (not from session) as shown above. The compile "how to" I found in this other question https://stackoverflow.com/questions/4617291/. Improve in velocity was big. Change from JOINS to SUBQUERIES was a big increase in performance too. Also recommend to use sqlalchemy_mixins, using the smart_query helped a lot to build the most efficient query. https://github.com/absent1706/sqlalchemy-mixins – Gustavo Gonçalves Dec 01 '19 at 15:34
16

I've been looking into efficient traversal/paging with SQLAlchemy and would like to update this answer.

I think you can use the slice call to properly limit the scope of a query and you could efficiently reuse it.

Example:

window_size = 10  # or whatever limit you like
window_idx = 0
while True:
    start,stop = window_size*window_idx, window_size*(window_idx+1)
    things = query.slice(start, stop).all()
    if things is None:
        break
    for thing in things:
        analyze(thing)
    if len(things) < window_size:
        break
    window_idx += 1
Joel
  • 307
  • 3
  • 7
  • This seems very simple and fast. I'm not sure the `.all()` is necessary. I notices the speed improved a lot after the 1st call. – hamx0r Mar 09 '15 at 22:39
  • @hamx0r I realize this is an old comment so just leaving it for posterity. Without `.all()` the things variable is a query that doesn't support len() – David Jul 31 '19 at 19:13
  • 1
    Under the hood this generates queries of the form `SELECT * FROM tbl LIMIT x OFFSET y`, so for large resultsets it will become slower as the offset size increases, as described in the accepted answer. – snakecharmerb Aug 25 '22 at 20:14
10

In the spirit of Joel's answer, I use the following:

WINDOW_SIZE = 1000
def qgen(query):
    start = 0
    while True:
        stop = start + WINDOW_SIZE
        things = query.slice(start, stop).all()
        if len(things) == 0:
            break
        for thing in things:
            yield thing
        start += WINDOW_SIZE
Martin Reguly
  • 133
  • 1
  • 9
Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45
4

Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get. Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).

Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.

Community
  • 1
  • 1
Victor Gavro
  • 1,347
  • 9
  • 13
2

AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.

Thus, using LIMIT on huge tables sounds like a good idea to me.

Pankrat
  • 5,206
  • 4
  • 31
  • 37
0

If you're working with Postgres or an RDBMS that supports cursors, it is very simple to iterate efficiently over a lot of rows:

 with db_session() as session:
     for partition in session.stream(select(Thing.id)).partitions(1000):
         for item in partition:
             analyze(item)

This creates a forward cursor that fetches the result in batches of 1000 rows, which results in minimal memory usage on the server and on the client.

auxsvr
  • 216
  • 5
  • 8
  • does this work this the normal session object use by the sqlalchemy ORM? I'm getting `'Session' object has no attribute 'stream'` – Matt Nov 15 '22 at 19:07
  • 1
    `session = sessionmaker(class_=AsyncSession, future=True, ...)()` here. I'm using Sqlalchemy master, 1.4 should support this too. – auxsvr Nov 16 '22 at 06:26