0

I have two tables, one for profiles and one for the profile's employment status. The two tables have a one-to-one relationship. One profile might might not have an employment status. The table schemas are as below (irrelevant columns removed for clarity):

create type employment_status as enum ('claimed', 'approved', 'denied');

create table if not exists profiles
(
    id bigserial not null
        constraint profiles_pkey
            primary key
);

create table if not exists employments
(
    id bigserial not null
        constraint employments_pkey
            primary key,
    status employment_status not null,
    profile_id bigint not null
        constraint fk_rails_d95865cd58
            references profiles
                on delete cascade
);

create unique index if not exists index_employments_on_profile_id
    on employments (profile_id);

With these tables, I was asked to list all unemployed profiles. An unemployed profile is defined as a profile not having an employment record or having an employment with a status other than "approved".

My first tentative was the following query:

SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

The assumption here was that all profiles will be listed with their respective employments, then I could filter them with the where condition. Any profile without an employment record would have an employment status of null and therefore be filtered by the condition. However, this query did not return profiles without an employment.

After some research I found this post, explaining why it doesn't work and transformed my query:

SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

Which actually did work. But, my ORM produced a slightly different query, which didn't work.

SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

The only difference being the select clause. I tried to understand why this slight difference produced such a difference an ran an explain analyze all three queries:

EXPLAIN ANALYZE SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

Hash Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
  Hash Cond: (e.profile_id = profiles.id)
  ->  Seq Scan on employments e  (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

Hash Right Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  ->  Seq Scan on employments  (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms

The first and second query plans are almost the same expect for the hash join for one and right hash join for the other, while the last query doesn't even do join or the where condition.

I came up with a forth query that did work:

EXPLAIN ANALYZE SELECT profiles.* FROM profiles 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE (employments.id IS NULL OR employments.status != 'approved')

Hash Right Join  (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
  Rows Removed by Filter: 1
  ->  Seq Scan on employments  (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms

My questions about the subject is:

  1. Why are the query plans for second and third queries different even though they have the same structure?
  2. Why are the query plans the first and fourth queries different even though they the same structure?
  3. Why is Postgres totally ignoring my join and where condition for the third query?

EDIT:

With the following sample data, the expected query should return 2 and 3.

insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);

insert into employments (profile_id, status) values (1, 'approved');
insert into employments (profile_id, status) values (2, 'denied');
moray95
  • 937
  • 2
  • 9
  • 12
  • Please in code questions give a [mre]--cut & paste & runnable code, including smallest representative example input as code; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. For SQL that includes DBMS & DDL (including constraints & indexes) & input as code formatted as a table. [ask] – philipxy Sep 29 '20 at 21:17
  • One query returns only columns from the left table, the other from both. There's no reason they would have the same plan, they ask for different things. The question is, why do you think your ORM should be asking for the same output rows as your by-hand query when it isn't? Given that it isn't, of course the plans are different. Presumably your by-hand query should simply also `select` `profiles.*` not `*`. – philipxy Sep 29 '20 at 21:23
  • @philipxy You are right, the query plan doesn't need to be exactly the same, but the result is very different because of it. I would expect the same result from the two queries with the only differences being the columns returns, which the query planner seems to ignore. – moray95 Sep 29 '20 at 21:27
  • I don't know what "very" means but results are different because you ask for different results. I don't know what "planner ignores" means.Under certain indexing, the left-table-only result won't access right table non-index data. (But you don't give a [mre].) Another standard comment of mine: Please research & summarize. For SQL that includes basics of optimization/performance--immediately leading to indexes, plans, statistics & SARGability. [Tips for asking a good SQL question](https://meta.stackoverflow.com/a/271056/3404097) Ask re optimization after you have learned & applied those basics. – philipxy Sep 29 '20 at 21:33
  • @philipxy I have added some sample data the expected result. – moray95 Sep 29 '20 at 21:38
  • I don't understand what you expect or why. Different result row requests will have different plans. Say what you expect & why. But also if you claim to expect something you should have justification per documentation. (And not just a personal guess at how planning/optimization/implementation works/acts. Because then saying you expect something is a misrepresenation--you don't have a reason to expect anything in particular.) PS What is"same structure" & what does it have to do with your expectations--with justification? – philipxy Sep 29 '20 at 21:39

1 Answers1

0

There must be a unique or primary key constraint on employments.profile_id (or it is a view with an appropriate DISTINCT clause) so that the optimizer knows that there can be at most one row in employments that is related to a given row in profiles.

If that is the case and you don't use employments's columns in the SELECT list, the optimizer deduces that the join is redundant and need not be calculated, which makes for a simpler and faster execution plan.

See the comment for join_is_removable in src/backend/optimizer/plan/analyzejoins.c:

/*
 * join_is_removable
 *    Check whether we need not perform this special join at all, because
 *    it will just duplicate its left input.
 *
 * This is true for a left join for which the join condition cannot match
 * more than one inner-side row.  (There are other possibly interesting
 * cases, but we don't have the infrastructure to prove them.)  We also
 * have to check that the inner side doesn't generate any variables needed
 * above the join.
 */
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263