Postgres NOT IN performance

Question

Any ideas how to speed up this query?

Input

EXPLAIN SELECT entityid FROM entity e

LEFT JOIN level1entity l1 ON l1.level1id = e.level1_level1id
LEFT JOIN level2entity l2 ON l2.level2id = l1.level2_level2id
WHERE 

l2.userid = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f' 
AND 
(entityid NOT IN 
(1377776,1377792,1377793,1377794,1377795,1377796... 50000 ids)
)

Output

Nested Loop  (cost=0.00..1452373.79 rows=3865 width=8)
  ->  Nested Loop  (cost=0.00..8.58 rows=1 width=8)
        Join Filter: (l1.level2_level2id = l2.level2id)
        ->  Seq Scan on level2entity l2  (cost=0.00..3.17 rows=1 width=8)
              Filter: ((userid)::text = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f'::text)
        ->  Seq Scan on level1entity l1  (cost=0.00..4.07 rows=107 width=16)
  ->  Index Scan using fk_fk18edb1cfb2a41235_idx on entity e  (cost=0.00..1452086.09 rows=22329 width=16)
        Index Cond: (level1_level1id = l1.level1id)

OK here a simplified version, the joins aren't the bottleneck

SELECT enitityid FROM 
(SELECT enitityid FROM enitity e LIMIT 5000) a

WHERE
(enitityid NOT IN 
(1377776,1377792,1377793,1377794,1377795, ... 50000 ids)
)

the problem is to find the enties which don't have any of these ids

EXPLAIN

Subquery Scan on a  (cost=0.00..312667.76 rows=1 width=8)
  Filter: (e.entityid <> ALL ('{1377776,1377792,1377793,1377794, ... 50000 ids}'::bigint[]))
  ->  Limit  (cost=0.00..111.51 rows=5000 width=8)
        ->  Seq Scan on entity e  (cost=0.00..29015.26 rows=1301026 width=8)

**You need to show us the table and index definitions.** Diagnosing slow queries requires full table and index definitions, not just a description or paraphrase. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. If you know how to do an `EXPLAIN` or get an execution plan, put the results in the question as well. — Andy Lester, Jul 23 '13 at 14:52
Chances are that that 50,000 IDs in that NOT IN clause what is forcing a full sequential scan. But we must see the table definitions to know. — Andy Lester, Jul 23 '13 at 14:53
Is there an index on enitity(enitityid)? Is it the PK? Did you `analyze` the table? — Clodoaldo Neto, Jul 23 '13 at 15:21

score 39 · Accepted Answer · edited May 23 '17 at 12:19

A huge IN list is very inefficient. PostgreSQL should ideally identify it and turn it into a relation that it does an anti-join on, but at this point the query planner doesn't know how to do that, and the planning time required to identify this case would cost every query that uses NOT IN sensibly, so it'd have to be a very low cost check. See this earlier much more detailed answer on the topic.

As David Aldridge wrote this is best solved by turning it into an anti-join. I'd write it as a join over a VALUES list simply because PostgreSQL is extremely fast at parsing VALUES lists into relations, but the effect is the same:

SELECT entityid 
FROM entity e
LEFT JOIN level1entity l1 ON l.level1id = e.level1_level1id
LEFT JOIN level2entity l2 ON l2.level2id = l1.level2_level2id
LEFT OUTER JOIN (
    VALUES
    (1377776),(1377792),(1377793),(1377794),(1377795),(1377796)
) ex(ex_entityid) ON (entityid = ex_entityid)
WHERE l2.userid = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f' 
AND ex_entityid IS NULL;

For a sufficiently large set of values you might even be better off creating a temporary table, COPYing the values into it, creating a PRIMARY KEY on it, and joining on that.

More possibilities explored here:

https://stackoverflow.com/a/17038097/398670

Is there a way to do the same for `inet` column? We have a list of IPv4 address that we want to exclude in the query. `explain analyze SELECT d.* FROM ip_table d LEFT OUTER JOIN ( VALUES ('0.0.0.0'), ('127.0.0.1'), ('10.0.0.1'), ('255.255.255.255'),('::') ) ex(ex_entityid) ON (ip_addr in (ex_entityid)) where d.col2 = 27;` — CodeRain, May 03 '20 at 12:36
So very helpful! I took a query with a `NOT IN` and turned it into an anti-JOIN. Sped it up by a factor of 20 or so. — user1071847, Sep 17 '20 at 14:10
Glad. For other readers, be aware that this is 7 years old. Always check old answers vs current code and behaviour. I am not aware of any optimisations relating to this, but you should check. — Craig Ringer, Sep 27 '20 at 04:30

score 7 · Answer 2 · answered Jul 23 '13 at 15:50

7

You might get a better result if you can rewrite the query to use a hash anti-join.

Something like:

with exclude_list as (
  select unnest(string_to_array('1377776,1377792,1377793,1377794,1377795, ...',','))::integer entity_id)
select entity_id
from   entity left join exclude_list on entity.entity_id = exclude_list.entity_id
where  exclude_list.entity_id is null;

answered Jul 23 '13 at 15:50

David Aldridge

51,479
8
68
96

I'd personally use a `VALUES` list, since that produces a relation extremely efficiently and directly, or at least `unnest` an `ARRAY[]` constructor, but either way building a relation and doing an anti-join on it is certainly the right way. – Craig Ringer Jul 24 '13 at 03:40

score 2 · Answer 3 · answered Jul 23 '13 at 15:20

ok my solution was

select all entities
left join all entities which have one of the ids (without the not is is faster) on the entityid
select all rows where the joined select is NULL

as explained in

http://blog.hagander.net/archives/66-Speeding-up-NOT-IN.html

score 0 · Answer 4 · answered Jul 23 '13 at 14:54

Since you are requiring level2entity record because of your where clause check for a specific userid "l2.userid = " You should make your "LEFT JOIN level2entity" into an "INNER JOIN level2entity"

INNER JOIN level2entity l2 ON l2.level2id = l1.level2_level2id AND l2.userid = 'a987c246-65e5-48f6-9d2d-a7bcb6284c8f'

This will, hopefully, filter down your entity's so your NOT IN will have less work to do.

Postgres NOT IN performance

4 Answers4

Linked