1

I have written a SQL query that seems very simple, tableA with primary key x, and ytime is a timestamp field.

select a.x, b.z from 
    # tableA join tableB etc.
where a.id not in (select x from tableA where a.ytime is not null)  

There is another stackflow similar to this one but that one only talks about smaller subset of data.

Why SQL "NOT IN" is so slow?

Not sure if I need to index ytime column.

Alchemist
  • 849
  • 2
  • 10
  • 27
  • 1
    Please post `explain (analyze, buffers)`. Also, when you use subquery for your `IN` condition then postgres actually joins those tables instead. – Łukasz Kamiński Nov 30 '17 at 11:29

1 Answers1

2

I have no science to back this up -- only antecdotal history, but for cases where the "in" list is significantly large, I have found that the semi-join (or anti-join in this case) is typically much more efficient than the in list:

select a1.x, b.z
from 
  tableA a1
  join tableB b on ...
where not exists (
  select null
  from tablea a2
  where a1.id = a2.x and a2.ytime is not null
)

When I say significant, a query than ran for minutes using the in-list ran in just seconds after I changed it to the semi.

Try it and see if it makes a difference.

I made some assumptions since your code was largely notional, but I think you get the idea.

Hambone
  • 15,600
  • 8
  • 46
  • 69