Using IN
for very large lists is indeed pathological, and you might be better served using a temporary table and IN
against a subquery or a join. But the question was "how to override compiler output for specific operator". In case of binary operators such as IN
and NOT IN
what you need to override is how SQLAlchemy handles compiling BinaryExpression
s:
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.elements import BinaryExpression
from sqlalchemy.sql.operators import in_op, notin_op
def visit_in_op_binary(compiler, binary, operator, **kw):
return "%s IN %s" % (
compiler.process(binary.left, **kw),
compiler.process(binary.right, **{**kw, "literal_binds": True}))
def visit_notin_op_binary(compiler, binary, operator, **kw):
return "%s NOT IN %s" % (
compiler.process(binary.left, **kw),
compiler.process(binary.right, **{**kw, "literal_binds": True}))
@compiles(BinaryExpression)
def compile_binary(binary, compiler, override_operator=None, **kw):
operator = override_operator or binary.operator
if operator is in_op:
return visit_in_op_binary(
compiler, binary, operator, override_operator=override_operator,
**kw)
if operator is notin_op:
return visit_notin_op_binary(
compiler, binary, operator, override_operator=override_operator,
**kw)
return compiler.visit_binary(binary, override_operator=override_operator, **kw)
Note that simply producing the binary expression containing the grouping and clause list of bind params takes a surprisingly lot of time for very large lists, not to mention compiling all that even if using literal binds, so you might not observe significant performance gains. On the other hand many implementations have limits on how many placeholders / parameters you can use in a statement, and so inlining the binds allows such queries to run at all.
If on the other hand your list does fit within the limits set by your implementation (Postgresql seems to be only limited by available RAM), you may not need any compiler workarounds with a recent enough SQLAlchemy; use expanding bind parameters instead:
In [15]: %%time
...: session.query(Foo).\
...: filter(Foo.data.in_(range(250000))).\
...: all()
...:
CPU times: user 5.09 s, sys: 91.9 ms, total: 5.18 s
Wall time: 5.18 s
Out[15]: []
In [16]: %%time
...: session.query(Foo).\
...: filter(Foo.data.in_(bindparam('xs', range(250000), expanding=True))).\
...: all()
...:
CPU times: user 310 ms, sys: 8.05 ms, total: 318 ms
Wall time: 317 ms
Out[16]: []
And as mentioned in comments, in version 1.4 an expanding bindparam
will support literal execution out of the box:
In [4]: session.query(Foo).\
...: filter(Foo.data.in_(
...: bindparam('xs', range(10), expanding=True, literal_execute=True))).\
...: all()
2019-09-07 20:35:04,560 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine SELECT foo.id AS foo_id, foo.data AS foo_data
FROM foo
WHERE foo.data IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
2019-09-07 20:35:04,561 INFO sqlalchemy.engine.base.Engine ()
Out[4]: []