0

There are 2 tables in my system: journal and journal_contribution. Both have a field uuid as primary key and the journal_contribution contains the field journal_uuid to link it to a journal. The other fields are not of importance. The journal table has 461283 records and the journal_contribution 336136 records.

I want to determine how many journals I have that do not have a journal_contribution referring to them.

My first idea was the following query

select count(*)
from journal 
where uuid not in 
(select journal_uuid as uuid
from journal_contribution)

I know this is not very performant but I was surprised that after 2 minutes I got 0 as result. Even more so that if I run the same query without the 'not' I do get as result 124121. If I subtract this from the total number of journals I would have expected the original result to be 337162.

When I change the code to the following more performant version I do get the right result but I want to understand the 0 in the first place. Can somebody explain this to me?

select count(*)
from journal 
where not exists
(select 1  
from journal_contribution jc
where jc.journal_uuid = journal.uuid)

These queries were done in pgAdmin 4.21 running PostgreSQL 11

od107
  • 67
  • 6

3 Answers3

5

I strongly, strongly discourage you from using not in with subqueries. The reason is simple: NULL values are not handled intuitively.

As you have observed, NOT IN filters out all rows if any of the rows returned by the subquery are NULL.

There is a simple alternative: not exists:

select count(*)
from journal j
where not exists (select 1
                  from journal_contribution jc
                  where jc.journal_uuid = j.uuid
                 );

In addition, this query can make use of an index on journal_contribution(journal_uuid). I am guessing that with such an index, this query will be rather speedy.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

Apparently the subquery contained NULL values.

Changing the query to

select count(uuid)
from journal 
where uuid not in 
(select distinct journal_uuid as uuid
from journal_contribution
 where journal_uuid is not null)

gives the correct result

The reason why this gives the result it gives is answered in SQL select with "IN" subquery returns no records if the sub-query contains NULL

The EXISTS operator has apparently no problem with NULL values

od107
  • 67
  • 6
2

The problem is the column "journal_uuid" is nullable in "journal_contribution" table. When trying to use the IN operator, NULL value will always return nothing. Just like to writing something like WHERE id = NULL will return nothing. The other query is working because you're trying to compare uuid values from the "journal" table and if I'm not wrong it's a primary key and not nullable

  • Well the values compared are the same. It seems EXISTS does not have an issue with NULL values – od107 Feb 19 '21 at 11:07
  • in the first query, the sub query will return a set of values, one or more of the values will be null because journal_contribution.journal_uuid is nullable so the query will look something like ``` select count(*) from journal where uuid not in (null, ...values) ``` which will always return nothing because the IN operator will compare each value with the "=" sign just like ```id = NULL``` will not work. The second query works that way: the sub query will try to find a all rows wth the same uuid and if so the exists function will return true or false. – Bini Barazany Feb 19 '21 at 11:50