Is there a way to make an SQL NOT IN query faster?

Question

I want to get the number of unique mobile phone entries per day that have been logged to a database and have never appeared in the log. I thought it was a trivial query but shock when the query took 10 minutes on a table with about 900K entries. A sample Select is getting the number of unique mobile phones that were logged on the 9th of April 2015 and had never been logged before. Its like getting who are the truly new visitors to you site on a specific day. SQL Fiddle Link

SELECT COUNT(DISTINCT mobile_number)
FROM log_entries
WHERE created_at BETWEEN '2015-04-09 00:00:00'
    AND '2015-04-09 23:59:59'
    AND mobile_number NOT IN (
        SELECT mobile_number
        FROM log_entries
        WHERE created_at < '2015-04-09 00:00:00'
        )

I have individual indexes on created_at and on mobile_number.

Is there a way to make it faster? I see a very similar question here on SO but that was working with two tables.

side note : `WHERE created_at betwen '2015-04-09 00:00:00'and '2015-04-09 23:59:59'` instead of `WHERE created_at >= '2015-04-09 00:00:00' AND created_at <= '2015-04-09 23:59:59'` — Vivek S., Apr 10 '15 at 06:54
@unique_id I think your only edit was the `between` syntax which I agree is more elegant. However, the query is running and its on its 7th minute and haven't got a result so don't think that solves the problem — lukik, Apr 10 '15 at 07:11
@lukik I've added `SELECT DISTINCT mobile_number` instead of `SELECT mobile_number` inside `NOT IN` also.. — Vivek S., Apr 10 '15 at 07:14
how much time this `SELECT mobile_number FROM log_entries WHERE created_at < '2015-04-09 00:00:00'` takes ? — Vivek S., Apr 10 '15 at 07:17
@unique_id: the `distinct` in the sub-query is unnecessary and will in fact make the query slower. — , Apr 10 '15 at 07:18
Does the `not exist` improve the performance? http://sqlfiddle.com/#!15/9ee8e/24 — , Apr 10 '15 at 07:20
@a_horse_with_no_name what? returned results in less than 0.5ms. Am now even doubting if its correct :-) Maybe you can put your solution in the answers for others to vet.. — lukik, Apr 10 '15 at 07:28
Could you show us the results (for your current and new query) of EXPLAIN ANALYZE? Please use http://explain.depesz.com to post the results. — Frank Heikens, Apr 10 '15 at 08:16
It *really* helps if you provide sample schema and data for questions like this. Ideally via http://sqlfiddle.com — Craig Ringer, Apr 10 '15 at 08:34
@FrankHeikens I've added the explain links for the query I provided in my question on the link http://explain.depesz.com/s/eup9 and for the solution provided by `a_horse_with_no_name` on the link http://explain.depesz.com/s/B64O pardon me if they are not correct. its the first time I've used it — lukik, Apr 10 '15 at 08:48
@lukik: Could you use EXPLAIN ANALYZE? Without ANALYZE the query isn't executed and there is no timing available. Without timing, you can't see any improvement, just a different plan. And "different" doesn't mean "better" — Frank Heikens, Apr 10 '15 at 09:39
I wonder if there is a case here for a redesign, such that the first entry of a mobile number into the log table is itself logged separately in the table that holds mobile phone numbers, or another one just for this purpose. — David Aldridge, Apr 10 '15 at 10:51
@lukik: would be nice if you use `explain (analyze, verbose)` instead of `explain` to generate the execution plan so that the actual run times of each step are visible. — , Apr 10 '15 at 10:53
Here is the `Explain, Analyze` for the @a_horse_with_no_name SQL. http://explain.depesz.com/s/tnWG for the previous one posted in this question, its still running 45 minutes in! — lukik, Apr 10 '15 at 11:08
Aah, its just finished! Here is the `Explain Analyze` for the query in the question http://explain.depesz.com/s/3Bb6 — lukik, Apr 10 '15 at 11:11
Thanks. Could you add the plan for Craig's solution as well (please don't check the "anonymized" option) — , Apr 10 '15 at 11:20
Ok. And here they are not anonymized: From CraigRinger: http://explain.depesz.com/s/NC3 From a_horse_with_no_name: http://explain.depesz.com/s/DcZw My Original Query: http://explain.depesz.com/s/WNE2 Please explain the findings — lukik, Apr 10 '15 at 11:29

score 4 · Accepted Answer · 2015-04-10T07:38:13.293

A NOT IN can be rewritten as a NOT EXISTS query which is very often faster (unfortunately the Postgres optimizer isn't smart enough to detect this).

SELECT COUNT(DISTINCT l1.mobile_number) 
FROM log_entries as l1
WHERE l1.created_at >= '2015-04-09 00:00:00' 
  AND l1.created_at <= '2015-04-09 23:59:59' 
  AND NOT EXISTS (SELECT * 
                  FROM log_entries l2
                  WHERE l2.created_at < '2015-04-09 00:00:00'
                    AND l2.mobile_number = l1.mobile_number);

An index on (mobile_number, created_at) should further improve the performance.

A side note: created_at <= '2015-04-09 23:59:59' will not include rows with fractional seconds, e.g. 2015-04-09 23:59:59.789. When dealing with timestamps it's better to use a "lower than" with the "next day" instead of a "lower or equal" with the day in question.

So better use: created_at < '2015-04-10 00:00:00' instead to also "catch" rows on that day with fractional seconds.

I like this one. NOT EXISTS is semantically correct, and the anti-semijoin is very fast. — David Aldridge, Apr 10 '15 at 10:44

Craig Ringer · Answer 2 · 2015-04-10T08:31:45.037

1

I tend to suggest transforming NOT IN into a left anti-join (i.e. a left join that only keeps the left rows that do not match the right side). It's complicated somewhat in this case by the fact that it's a self join against two distinct ranges of the same table, so you're really joining two subqueries:

SELECT COUNT(n.mobile_number)
FROM (
  SELECT DISTINCT mobile_number
  FROM log_entries
  WHERE created_at BETWEEN '2015-04-09 00:00:00' AND '2015-04-09 23:59:59'
) n
LEFT OUTER JOIN (
  SELECT DISTINCT mobile_number
  FROM log_entries
  WHERE created_at < '2015-04-09 00:00:00'
) o ON (n.mobile_number = o.mobile_number)
WHERE o.mobile_number IS NULL;

I'd be interested in the performance of this as compared with the typical NOT EXISTS formulation provided by @a_horse_with_no_name.

Note that I've also pushed the DISTINCT check down into the subquery.

Your query seems to be "how many newly seen mobile numbers are there in <time range>". Right?

edited Apr 10 '15 at 08:31

answered Apr 10 '15 at 07:42

Craig Ringer

307,061
76
688
778

Getting an error `[Err] ERROR: missing FROM-clause entry for table "l" LINE 5: WHERE l.created_at BETWEEN '2015-04-09 00:00:00' AND '2015...` – lukik Apr 10 '15 at 07:56
@lukik Oops. Editing error. Fixing. Fixed. (This is why having a http://sqlfiddle.com/ is nice in questions) – Craig Ringer Apr 10 '15 at 08:31
Took 8.63 seconds. Am adding an `EXPLAIN` link as requested in the comments above you can see what's going on – lukik Apr 10 '15 at 08:35
@lukik Thanks. Would also be worth removing the `distinct`s on the inner queries and restoring to the count. – Craig Ringer Apr 10 '15 at 08:39
1

I'd expect that if there's a weakness in this method it lies in the potentially large number of values to be accessed in the second subquery, which has to scan a log history (granted it might just be an index on it). Accessing a few thousand rows in the first subquery is not so bad, but there may be millions/billions in the second. If so, and if _relatively_ few records are returned from the first subquery then I'd expect the NOT EXISTS formulation to work better. Very hard to tell without cardinality info from the real data. – David Aldridge Apr 10 '15 at 10:48
@DavidAldridge Yes, I tend to agree - if there's an appropriate index in place, the `not exists` subquery will probably work better. I was curious, as they often actually land up producing similar plans, but the `distinct` here throws a bit of a wrinkle in things. – Craig Ringer Apr 10 '15 at 11:06

score 0 · Answer 3 · answered Apr 10 '15 at 06:57

0

Isn't WHERE created_at >= '2015-04-09 00:00:00' AND created_at <= '2015-04-09 23:59:59' taking care of WHERE created_at < '2015-04-09 00:00:00'? Am I missing something here?

answered Apr 10 '15 at 06:57

Phoenix

335
2
9

I've edited the question for clarity. Please see if that answers it. Am trying to get mobile phone numbers that have been logged in a specific date that have never appeared before in that log – lukik Apr 10 '15 at 07:01
I'm sure about the performance, but you could try this: `SELECT COUNT(DISTINCT mobile_number) FROM log_entries x WHERE created_at >= '2015-04-09 00:00:00' AND created_at <= '2015-04-09 23:59:59' AND (SELECT count(mobile_number) FROM log_entries y WHERE created_at < '2015-04-09 00:00:00' and x.mobile_number = y.mobile_number) = 0;` – Phoenix Apr 10 '15 at 07:08

score 0 · Answer 4 · answered Apr 10 '15 at 07:01

0

NOT IN isn't fast at all. And your subquery returns a lot of repeating records. Maybe you should put unique numbers to dedicated table (because GROUP BY will be slow too).

answered Apr 10 '15 at 07:01

szarlih

135
9

score 0 · Answer 5 · answered Apr 10 '15 at 07:04

0

Try something like this:

SELECT mobile_number, min(created_at)
FROM log_entries
GROUP BY mobile_number
HAVING min(created_at) between '2015-04-09 00:00:00' and '2015-04-09 23:59:59'

Adding a single index covering both mobile_number and created_at will improve performance slightly, assuming that there are other columns in the table, as only that index will need to be scanned.

answered Apr 10 '15 at 07:04

TobyLL

2,098
2
17
23

How is your query taking care of NOT returning mobile phone numbers that were logged in the selected time but had been entered into the table before the 9th of April 2015 – lukik Apr 10 '15 at 07:09
If a mobile number was first seen on 2015-04-01, then min(created_at) will be that date, and the HAVING clause will exclude it. – TobyLL Apr 10 '15 at 07:12
This takes 8.5 seconds. Its an improvement but there are faster options given in the comments of the question – lukik Apr 10 '15 at 07:32

Wiol · Answer 6 · 2015-04-10T10:14:31.320

0

Try use WITH(if your sql support it). Here is help(postgres):http://www.postgresql.org/docs/current/static/queries-with.html

And your query should looks like that:

WITH  b as
(SELECT distinct mobile_number
        FROM log_entries
        WHERE created_at < '2015-04-09 00:00:00') 
SELECT COUNT(DISTINCT a.mobile_number)
FROM log_entries a   
left join b using(mobile_number)
where created_at >= '2015-04-09 00:00:00'
   AND created_at <= '2015-04-09 23:59:59' and b.mobile_number is null;

edited Apr 10 '15 at 10:14

answered Apr 10 '15 at 07:22

Wiol

11
2

you will need a `left join` not a `join` (which is an inner join) otherwise the condition `b.mobile_number is null` will never be true. – Apr 10 '15 at 07:26
Yes left join;) my mistake:) – Wiol Apr 10 '15 at 10:14
This time it ran and it took 10.26 ms. Check out the explain analyze for the other queries in the comments for the questions maybe that might help – lukik Apr 10 '15 at 11:34

Is there a way to make an SQL NOT IN query faster?

6 Answers6