Performance selecting rows not matching entry in another table when NULL is present

Question

Related question: How to select rows with no matching entry in another table?

I was trying to select rows using this method and couldn't get it to work in SQLite. After a bit of wrangling it occurred to me that the reason might be that there are NULL values in some fields. Sure enough, I was right, and when I changed the = to IS in the query below things started behaving as expected:

CREATE TEMP TABLE newEvent(id INTEGER,t INTEGER,name,extra,extra2,extra3);
INSERT INTO newEvent(id,t,name,extra,extra2,extra3) VALUES
                           (0, 1376351146, 'TEST', NULL, NULL, NULL),
                           (0, 1376348867, 'OLD', NULL, NULL,NULL);
SELECT n.id,n.t,n.name,n.extra,n.extra2,n.extra3 FROM newEvent n 
       LEFT JOIN event E ON n.t = E.t AND n.name IS E.name
                                      AND n.extra IS E.extra;
                                      AND n.extra2 IS E.extra2;
                                      AND n.extra3 IS E.extra3
       WHERE E.id IS NULL;
DROP TABLE newEvent;

In the above example, there is an existing record in table event with name='OLD'. The newEvent table is defined the same as the original event table.

However, I noticed a BIG problem: my query was now taking almost 30 seconds to run! If I change only the n.name IS E.name to n.name = E.name but leave all the other ISes as-is, then the query only takes around 400ms. (There are around 3 million records in table event.)

Why the big difference in performance? It turns out I can actually use = instead of IS for the name comparison because it is never null, but if it ever were to be NULL it seems like this would break. Conversely, I am concerned that at some point the query might start running slow, since I don't understand what it is about name that makes the equality query run so much faster. My guess is that maybe SQLite somehow knows that there are nulls in the extra fields and is able to optimize but I would like something a bit more firm than a wild guess.

As far as I can tell, IS is simply = with the additional provisio that it will treat NULL comparisions the same as if they were empty strings (assuming there are no actual empty strings to compare). So why is using = on the name field 75 times faster, but has no effect on performance on the extra fields???

Why are you not comparing the `id` column? Can it be different for rows that you want to be matching? — CL., Aug 13 '13 at 07:11
@Andriy name and the extra fields are all untyped (types are more "suggestions" to SQLite anyway). Name is always a string, while extras are both string, number, and NULL. It does make a difference, as SQLite won't match "200" as equal to 200, for example. — Michael, Aug 13 '13 at 14:26
@CL I'm not matching the row because it is automatic and I don't know it until the insert occurs. I'm basically trying to prevent the same record from being inserted twice (it is possible in rare circumstances in the architecture I'm using). The other fields are sufficient to uniquely identify a record. — Michael, Aug 13 '13 at 14:27
Thank you, didn't know about the untyped columns thing, that's a curious aspect of SQLite. (And removing my particularly noisy first comment.) — Andriy M, Aug 14 '13 at 04:38

score 1 · Accepted Answer · answered Aug 13 '13 at 07:21

In a join, SQLite can optimize = with index lookups, but not IS. Furthermore, it is not possible to use more than one index per table in a single query.

So either you do not have a multi-column index that includes both name and extra*, or the selectivity of the additional column(s) is not high enough to matter.

You could try a completely different query using a compound SELECT:

SELECT t, name, extra, extra2, extra3 FROM newEvent
EXCEPT
SELECT t, name, extra, extra2, extra3 FROM event

However, this does not allow you to get a column that is not compard (like your id).

Oddly enough, there is no index on the name or extra fields... there is one on the t field, however. — Michael, Aug 13 '13 at 15:01

Performance selecting rows not matching entry in another table when NULL is present

1 Answers1