24

Let's say I have a set of items:

  • Item1
  • Item2
  • Item3
  • Item4
  • Item5

A query can be constructed in two ways. Firstly:

SELECT * 
FROM TABLE 
WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5')

Or, it can be written as:

SELECT * 
FROM TABLE 
WHERE ITEM != 'item1' 
  AND ITEM != 'item2' 
  AND ITEM != 'item3' 
  AND ITEM != 'item4' 
  AND ITEM != 'item5'
  1. Which is more efficient and why?
  2. At what point does one become more efficient than the other? In other words, what if there were 500 items?

My question is specifically relating to PostgreSQL.

rockstardev
  • 13,479
  • 39
  • 164
  • 296
  • 4
    Little nitpick: the standard SQL operator for "not equals" is `<>` although all (?) DBMS seem to support the non-standard `!=` just as well. –  Jun 11 '13 at 06:16
  • 2
    When you say "more efficient", do you mean "faster"? "Efficient" can refer to many things other than just execution speed. – Andy Lester Jun 12 '13 at 04:23
  • 2
    Efficient can refer to execution time and resource use. Pretty much two things. – grantwparks Aug 02 '18 at 20:01

2 Answers2

72

In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN is much cleaner conceptually. Very long AND ... <> ... lists and very long NOT IN lists both perform terribly, with AND much worse than NOT IN.

In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.

WITH excluded(item) AS (
    VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT * 
FROM thetable t
WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);

or:

WITH excluded(item) AS (
    VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5')
)
SELECT * 
FROM thetable t
LEFT OUTER JOIN excluded e ON (t.item = e.item)
WHERE e.item IS NULL;

(On modern Pg versions both will produce the same query plan anyway).

If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY table, COPYing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.

Demo:

CREATE UNLOGGED TABLE exclude_test(id integer primary key);
INSERT INTO exclude_test(id) SELECT generate_series(1,50000);
CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;

where exclude is the list of values to omit.

I then compare the following approaches on the same data with all results in milliseconds:

  • NOT IN list: 3424.596
  • AND ... list: 80173.823
  • VALUES based JOIN exclusion: 20.727
  • VALUES based subquery exclusion: 20.495
  • Table-based JOIN, no index on ex-list: 25.183
  • Subquery table based, no index on ex-list: 23.985

... making the CTE-based approach over three thousand times faster than the AND list and 130 times faster than the NOT IN list.

Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).

For this data set size adding an index on the exclusion list made no difference.

Notes:

  • IN list generated with SELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
  • AND list generated with SELECT string_agg(item::text, ' AND item <> ') from exclude;)
  • Subquery and join based table exclusion were much the same across repeated runs.
  • Examination of the plan shows that Pg translates NOT IN to <> ALL

So... you can see that there's a truly huge gap between both IN and AND lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES list was ... parsing the VALUES list took almost no time at all, performing the same or slightly faster than the table approach in most tests.

It'd be nice if PostgreSQL could automatically recognise a preposterously long IN clause or chain of similar AND conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.

See also:

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • There is no specific limitation for postgresql, but some database have a limit on how big the `IN` operator can get, which gives +1 to the `AND ... <> ...` construct. – Burhan Khalid Jun 11 '13 at 06:58
  • 6
    @BurhanKhalid Using chained `AND ... <> ...` also makes life hard on the parser and planner. There were recent mailing list reports of query planning taking *minutes* for queries with tens of thousands of such clauses, as generated by some awful ORMs. – Craig Ringer Jun 11 '13 at 06:58
  • 1
    Damn ... _*minutes*_ for _planning_?! – Burhan Khalid Jun 11 '13 at 07:09
  • 3
    @BurhanKhalid The query planner would have to coalesce huge `AND` inequality lists into a `NOT IN` list to get a faintly sane result, which its self would take time, and make planning slower for the vast majority of queries that are not insane. This is one of those "don't do this" things ... if you're generating a query with *50,000 simple `OR` clauses* like one I saw recently, you're *doing it wrong*. Minutes of planning time is terrible, but so is making the common case much slower to deal with bizarre corner cases. – Craig Ringer Jun 11 '13 at 08:00
  • [The `OUTER` keyword is optional](https://stackoverflow.com/q/406294/6064933), for anyone interested. Why not just drop `OUTER` and use `LEFT JOIN` instead? – jdhao Jun 03 '22 at 14:58
  • @jdhao Personal preference, I find it more readable and explicit. – Craig Ringer Jun 10 '22 at 02:13
13

I disagree somewhat with the original accepted answer by @Jayram.

Not least, the link is for SQL Server and contradicts many other articles and answers. Also, there are no indexes on the sample table.

Normally, for subquery SQL constructs

  • <> (or !=) is a scalar comparison
  • NOT IN is an left-anti-semi-join relational operator

In simpler terms

  • NOT IN becomes a form of JOIN that can use an index (except PostgreSQL!)
  • != is often non-SARGable and an index may not be used

This was discussed on dba.se: "The use of NOT logic in relation to indexes". For PostgreSQL, then this explainextended article explains the internal more (but not for a list of constants with NOT IN unfortunately).

Either way, for a list of constants, I'd use NOT IN before <> generally because it's easier to read and because of what @CraigRinger explained.

For a subquery, NOT EXISTS is the way to go

Community
  • 1
  • 1
gbn
  • 422,506
  • 82
  • 585
  • 676
  • Neither of these are quite right for PostgreSQL; it can sometimes use an index for `<>` where table statistics support the theory that the excluded value is overwhelmingly common, and AFAIK it can't use an index for a `NOT IN` list, which it internally translates to `id <> ALL`. – Craig Ringer Jun 11 '13 at 08:07
  • My 2nd link says it can't unlike other RDBMS. Anyhow, I'd use NOT EXISTS – gbn Jun 11 '13 at 08:09
  • 1
    Totally agree; `NOT EXISTS` or an anti-join on the data list is the sane way. PostgreSQL turns `not exists` into an anti-join anyway. – Craig Ringer Jun 11 '13 at 08:10
  • You might want to change your first sentence now that the accepted answer was changed ;) –  Jun 11 '13 at 14:42