6

The default SQLAlchemy behavior for compiling in_ expressions is pathological for very large lists, and I want to create a custom, faster, compiler for the operator. It doesn't matter to the application if the solution is a new operator (i.e.: in_list_) or if it overrides the default compiler for in_. However, I haven't been able to find any documentation on how to do this specifically.

The subclassing guidelines for compilation extension don't include anything about operators, suggesting that is not the place to start. The documentation on redefining and creating new operators is focused on changing or creating new operator behavior, but the behavior of the operator is not the problem, just the compiler.

Here is a very non-working example of what I'm trying to accomplish:

from sqlalchemy.types import TypeEngine

class in_list_(TypeEngine.Comparator):
  pass

@compiles(in_list_)
def in_list_impl(element, compiler, **kwargs):
  return "IN ('Now', 'I', 'can', 'inline', 'the', 'list')"

And then in an expression:

select([mytable.c.x, mytable.c.y]).where(mytable.c.x.in_list_(long_list))
wst
  • 11,681
  • 1
  • 24
  • 39
  • 1
    Recent similar question: https://stackoverflow.com/q/57427283/6560549 – SuperShoot Sep 07 '19 at 02:40
  • Related https://stackoverflow.com/questions/56761442/sqlalchemy-set-membership-for-very-large-sets. – Ilja Everilä Sep 07 '19 at 06:58
  • what 's the actual problem you wish to solve? the IN operator is being completely reworked in 1.4 and you can use the new functionality right now with [expanding IN parameters](https://docs.sqlalchemy.org/en/13/orm/extensions/baked.html#baked-in) – zzzeek Sep 07 '19 at 15:39
  • in 1.4, you can combine "expanding" with "literal_execute" as well and get exactly what you're looking at above. so why not instead help with [development](https://www.sqlalchemy.org/develop.html) ? – zzzeek Sep 07 '19 at 15:40
  • 1
    see? https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/testing/suite/test_select.py#L509 – zzzeek Sep 07 '19 at 15:42
  • 1
    the plan is that "expanding" will be automatic for all IN expressions – zzzeek Sep 07 '19 at 15:43
  • Hi @zzzeek - The problem was that for large lists, the creation of bind parameters became expensive, in many cases nearly half the total execution time. But yes, thank you! I failed to test `bindparam` with the list, which successfully eliminated the overhead of binding every list item. – wst Sep 07 '19 at 19:39
  • 3
    yes this is a known issue and 1.4 is hoped to solve the problem completely. – zzzeek Sep 08 '19 at 15:11

3 Answers3

5

Using IN for very large lists is indeed pathological, and you might be better served using a temporary table and IN against a subquery or a join. But the question was "how to override compiler output for specific operator". In case of binary operators such as IN and NOT IN what you need to override is how SQLAlchemy handles compiling BinaryExpressions:

from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.elements import BinaryExpression
from sqlalchemy.sql.operators import in_op, notin_op

def visit_in_op_binary(compiler, binary, operator, **kw):
    return "%s IN %s" % (
        compiler.process(binary.left, **kw),
        compiler.process(binary.right, **{**kw, "literal_binds": True}))

def visit_notin_op_binary(compiler, binary, operator, **kw):
    return "%s NOT IN %s" % (
        compiler.process(binary.left, **kw),
        compiler.process(binary.right, **{**kw, "literal_binds": True}))

@compiles(BinaryExpression)
def compile_binary(binary, compiler, override_operator=None, **kw):
    operator = override_operator or binary.operator

    if operator is in_op:
        return visit_in_op_binary(
            compiler, binary, operator, override_operator=override_operator,
            **kw)

    if operator is notin_op:
        return visit_notin_op_binary(
            compiler, binary, operator, override_operator=override_operator,
            **kw)

    return compiler.visit_binary(binary, override_operator=override_operator, **kw)

Note that simply producing the binary expression containing the grouping and clause list of bind params takes a surprisingly lot of time for very large lists, not to mention compiling all that even if using literal binds, so you might not observe significant performance gains. On the other hand many implementations have limits on how many placeholders / parameters you can use in a statement, and so inlining the binds allows such queries to run at all.

If on the other hand your list does fit within the limits set by your implementation (Postgresql seems to be only limited by available RAM), you may not need any compiler workarounds with a recent enough SQLAlchemy; use expanding bind parameters instead:

In [15]: %%time
    ...: session.query(Foo).\
    ...:     filter(Foo.data.in_(range(250000))).\
    ...:     all()
    ...: 
CPU times: user 5.09 s, sys: 91.9 ms, total: 5.18 s
Wall time: 5.18 s
Out[15]: []

In [16]: %%time
    ...: session.query(Foo).\
    ...:     filter(Foo.data.in_(bindparam('xs', range(250000), expanding=True))).\
    ...:     all()
    ...: 
CPU times: user 310 ms, sys: 8.05 ms, total: 318 ms
Wall time: 317 ms
Out[16]: []

And as mentioned in comments, in version 1.4 an expanding bindparam will support literal execution out of the box:

In [4]: session.query(Foo).\
   ...:     filter(Foo.data.in_(
   ...:         bindparam('xs', range(10), expanding=True, literal_execute=True))).\
   ...:     all()
2019-09-07 20:35:04,560 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine SELECT foo.id AS foo_id, foo.data AS foo_data 
FROM foo 
WHERE foo.data IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine ()
Out[4]: []
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Great answer to my question, even though as you and @zzzeek correctly point out, expanding bind parameters are a much simpler solution. Your temp table answer was also very helpful. After translating the solution to MySQL, I ran some tests, and the temp table-based query run time is about break-even with the `bindparam` solution with a list size of about 175k integers. But I suspect it will be necessary to use the temp table if that list needs to increase significantly. – wst Sep 07 '19 at 19:51
0

One thing you can do is to drop to raw sql and build the query by hand. But, and this is crucial, YOU HAVE TO USE BINDS.

And once you commit to that, you need to manage the entire query's variables, not just the IN list bits. So you're totally on your own. It's unrealistic to do this frequently, unless you have a specialized, highly tested utility function, but it does work. And pretty quickly too: I have to chunk at 999 because Oracle doesn't go above, but neither Postgresql or Oracle have complained much otherwise. And, yes, this is under SQLAlchemy (1.3.8).

Here's some sample code. The bits for the list were all generated dynamically, to target Postgresql. Unfortunately, each RDBMS has its own flavor of placeholder format and bind variable. Refer to PEP249 paramstyles for more details.

what the generated query looks like:
qry = """select recname, objecttype
            from bme_mvprd
            where ignore = false
YOU HAVE TO BUILD THIS 
            and objecttype 
in ( 
%(objecttypes_000__)s
, %(objecttypes_001__)s
, %(objecttypes_002__)s
, %(objecttypes_003__)s
)
...
"""

The criteria originally came from this array: [0, 1, 2, 4]

and what you are passing as a bind parameter to execute look like this:

and what the bind parameters look like, again, Postgres-specific:

(yes, you need to generate this too)

sub = {
    'objecttypes_000__': 0,
    'objecttypes_001__': 1,
    'objecttypes_002__': 2,
    'objecttypes_003__': 4,
}

AND you have to execute(qry, sub) to use binding.

NO CHEATING with execute(qry % sub) which would work in Postgresql but would drop you right back into SQL Injection land.

Oracle uses :1, :2, :3 type placeholders so that would error out but Postgresql works with Python-type placeholders so you need to be super-careful you're not accidentally bypassing parameter binding.

Note: large IN SQL operations can sometimes be replaced with EXISTS tests and those ought to be preferred if possible. My chunk-by-999 example above is because there's no other way other than first populating a temp tables: it's an example, not best practice.

PPS: and what if your list is empty? I asked a question about that => answer = "... in (%(var001)s)...", {'var001':None} but you could just not add the IN (...) at all.

JL Peyret
  • 10,917
  • 2
  • 54
  • 73
  • 1
    I'm not going to downvote this, but you absolutely do not need to use binds. In fact, the entire impetus for this question is that SQLAlchemy automatically generates params to bind when it builds `IN` expressions, and once the list size is grows into to the 10s of thousands, the time spent string building params starts to become significant wrt total runtime. Everything counts in large amounts. – wst Sep 07 '19 at 19:58
  • @wst When talking about performance, bring numbers. Using **mogrify** (see [stackoverflow.com/how-do-i-get-a-raw-compiled-sql-query-from-a-sqlalchemy-expression](https://stackoverflow.com/questions/4617291/how-do-i-get-a-raw-compiled-sql-query-from-a-sqlalchemy-expression), I got an average time to a) parse to query + binds, then b) mogrify.execute. For 10K item lists : 0.3 sec. 3.5 seconds on 100K items lists. Not the same thing as actual database execution, but that bit would remain anyway. No idea how that compares to your numbers and will leave it at that. Hardware: MBP 2011 – JL Peyret Sep 08 '19 at 16:30
  • Numbers aren't relevant to the scope of the question. The question is asking how to extend part of SQLAlchemy. I intentionally avoided making it about performance because I already knew how to solve the performance problem. – wst Sep 09 '19 at 01:46
  • That’s an odd statement to make because you've been chatting up perf realities to me all this time. Whatever. – JL Peyret Sep 09 '19 at 01:52
-1

Full disclosure, I don't know how to override the compiler of in_(), and I know I'm at risk of oversimplifying, so please accept this in good faith, but given your example above I wouldn't even try. Rather, just create a helper function:

from sqlalchemy import Table, Column, String, MetaData
from sqlalchemy.sql import text


long_list = ['Now', 'I', 'can', 'inline', 'the', 'list']
tbl = Table("mytable", MetaData(), Column("x", String), Column("y", String))


def col_in_list(col, l):
    # do something safe to generate your in clause here.
    return text("IN ('Now', 'I', 'can', 'inline', 'the', 'list')")


if __name__ == "__main__":
    print(tbl.select().where(col_in_list(tbl.c.x, long_list)))

Renders:

SELECT mytable.x, mytable.y
FROM mytable
WHERE mytable.x IN ('Now','I','can','inline','the','list')
SuperShoot
  • 9,880
  • 2
  • 38
  • 55
  • Sorry, I am going to do something I rarely do and downvote. Concatenating strings to is the road to SQL Injection and a top 10 OWASP source of hacks. Just say No! And I don’t care if it’s “your variables”. Having this in your code base, esp as a general approach, has a way to attract user-sourced vars, esp w junior coders. https://xkcd.com/327/ – JL Peyret Sep 07 '19 at 15:49
  • @JLPeyret It's not an injection attack surface if there's no path for an outsider to inject. Using the same logic, one could argue that many SQLAlchemy features are similarly "unsafe" operations, albeit with more ceremony. The simple fact is that any language abstraction layer over SQL will probably require concatenating strings. – wst Sep 07 '19 at 18:07
  • @wst I am going to respectfully, but fundamentally agree to disagree with you there. No one sets out to expose their code to outside attack, but using insecure code, even in your internals, facilitates that. String concatenation is necessary to accommodate SQL keywords and schema objects. And only that. Past that point, use binds. Extra bonus: handling ‘None’ values with string manipulations is **hard** and binds do it very well. – JL Peyret Sep 07 '19 at 18:42
  • @wst edit,, as I realized you were the OP. If you do it just this once, you ring-fence your code with big !!!!!!!!! DONT USE USER VARIABLES HERE !!!!!! and lock it away... maybe. Bad practice. But it’s your code. As general advice? horrendous. – JL Peyret Sep 07 '19 at 19:03
  • @JLPeyret Of course it's sensible to use the safer, simpler solution when performance doesn't matter, but sometimes it does, and that advice is too strict to be useful generally. Sometimes you need sharp tools, and sharp tools are sharp, so you just have to be careful. – wst Sep 07 '19 at 20:17
  • 1
    @JLPeyret I was trying to illustrate that I thought overriding the compilation of the IN operator was heavy handed so I've removed the dynamically generated string from the answer. – SuperShoot Sep 07 '19 at 21:18
  • @SuperShoot Sorry if I was a bit harsh, and did not mean to censor, but you really need to be careful when giving examples of query concatenation techniques. You can post them, of course, but just wrap it with a big fat *THIS IS INSECURE.* warning and maybe link to some SQL Injection explanation. Then that's a perfectly OK way to make your point. Thing is, people do copy/paste from Stackoverflow and what is obviously risky to us may not be to beginners. – JL Peyret Sep 08 '19 at 16:42