Left outer join acting like inner join

Question

Summary

My goal is to find every user who has ever been assigned to a task, and then generate some statistics over a particular date range, and associate the stats with the original set of users. When no statistics exist for a particular user, I want a row in the output for the user but NULL values for the stats.

I have a complex SQL query that looks like this (actual query at the bottom):

SELECT
  user_name, changeday, project_name
  sum(hour_delta) AS hours,
FROM ( … ) tasked_users
LEFT OUTER JOIN ( … ) a
ON tasked_users.id=a.assignee_id
WHERE
  (changeday IS NULL) OR (changeday >= … AND changeday <= …)
GROUP BY user_name, changeday, a.project_name
ORDER BY user_name, changeday, a.project_name;

My desire is to find a large set of users and match them up with data from the a table; when there exist users that don't have any matching entries in a I want nulls or 0 hours.

Unfortunately, this query returns only rows for users present in 'a'. For example, one particular set of dates returns:

{:user_name=>"Gavin", :hours=>0.0, :changeday=>2013-09-08, :project_name=>"Foo"}
{:user_name=>"Steve", :hours=>1.0, :changeday=>2013-09-08, :project_name=>"Bar"}

While different date ranges result in different users being found. The contents of the tasked_users sub-query has 14 distinct user id/name pairs. I need all of them to be represented in the result.

Example query

In case it makes a difference, or in case you have additional helpful tips on improving the query, here's the full query.

SELECT
  user_name,
  sum(hour_delta) AS hours,
  changeday,
  project_name
FROM (
  SELECT DISTINCT
    users.id,
    users.name AS user_name
  FROM users
  INNER JOIN tasks AS tasks1
  ON users.id=tasks1.assignee_id
) tasked_users
LEFT OUTER JOIN
(
  SELECT
    (
      coalesce(cast(nullif(new_value,'') AS float),0) -
      coalesce(cast(nullif(old_value,'') AS float),0)
    ) AS hour_delta,
    task_id,
    tasks2.assignee_id AS assigned_log,
    fixin_id,
    projects.name AS project_name,
    date_trunc('day',task_log_entries.created_on) AS changeday
  FROM task_log_entries
  INNER JOIN tasks AS tasks2
  ON task_id=tasks2.id
  INNER JOIN fixins
  ON fixins.id=tasks2.fixin_id
  INNER JOIN projects
  ON projects.id=fixins.project_id
  WHERE field_id=18
) a
ON tasked_users.id=a.assigned_log
WHERE
  (changeday IS NULL)
  OR
  (changeday >= '2013-09-08' AND changeday <= '2013-09-08')
GROUP BY user_name, changeday, a.project_name
ORDER BY user_name, changeday, a.project_name;

Explain output

Here's the result of EXPLAIN for the query, in case it helps (I don't know how to read this and derive what I need):

GroupAggregate  (cost=1116.40..1116.99 rows=13 width=144)"}
  ->  Sort  (cost=1116.40..1116.43 rows=13 width=144)"}
        Sort Key: users.name, (date_trunc('day'::text, task_log_entries.created_on)), projects.name"}
        ->  Hash Left Join  (cost=1024.32..1116.16 rows=13 width=144)"}
              Hash Cond: (users.id = tasks2.assignee_id)"}
              Filter: ((date_trunc('day'::text, task_log_entries.created_on) IS NULL) OR ((date_trunc('day'::text, task_log_entries.created_on) >= '2013-09-08 00:00:00'::timestamp without time zone) AND (date_trunc('day'::text, task_log_entries.created_on) <= '2013-09-08 00:00:00'::timestamp without time zone)))"}
              ->  HashAggregate  (cost=44.07..45.46 rows=139 width=12)"}
                    ->  Hash Join  (cost=5.13..40.09 rows=795 width=12)"}
                          Hash Cond: (tasks1.assignee_id = users.id)"}
                          ->  Seq Scan on tasks tasks1  (cost=0.00..24.01 rows=801 width=4)"}
                          ->  Hash  (cost=3.39..3.39 rows=139 width=12)"}
                                ->  Seq Scan on users  (cost=0.00..3.39 rows=139 width=12)"}
              ->  Hash  (cost=963.51..963.51 rows=1339 width=30)"}
                    ->  Hash Join  (cost=729.23..963.51 rows=1339 width=30)"}
                          Hash Cond: (fixins.project_id = projects.id)"}
                          ->  Hash Join  (cost=727.91..943.79 rows=1339 width=24)"}
                                Hash Cond: (task_log_entries.task_id = tasks2.id)"}
                                ->  Seq Scan on task_log_entries  (cost=0.00..197.46 rows=1339 width=20)"}
                                      Filter: (field_id = 18)"}
                                ->  Hash  (cost=717.90..717.90 rows=801 width=12)"}
                                      ->  Hash Join  (cost=676.87..717.90 rows=801 width=12)"}
                                            Hash Cond: (tasks2.fixin_id = fixins.id)"}
                                            ->  Seq Scan on tasks tasks2  (cost=0.00..24.01 rows=801 width=12)"}
                                            ->  Hash  (cost=589.72..589.72 rows=6972 width=8)"}
                                                  ->  Seq Scan on fixins  (cost=0.00..589.72 rows=6972 width=8)"}
                          ->  Hash  (cost=1.14..1.14 rows=14 width=14)"}
                                ->  Seq Scan on projects  (cost=0.00..1.14 rows=14 width=14)"}

Table definitions

Here's the description of all tables involved. I have not trimmed them to remove any "irrelevant" columns so you can be sure to see if there are any ambiguous column name conflicts.

app=> \d task_log_entries
                                     Table "public.task_log_entries"
   Column   |            Type             |                           Modifiers
------------+-----------------------------+---------------------------------------------------------------
 id         | integer                     | not null default nextval('task_log_entries_id_seq'::regclass)
 task_id    | integer                     | not null
 user_id    | integer                     |
 field_id   | integer                     | not null
 created_on | timestamp without time zone | not null default now()
 new_value  | text                        |
 old_value  | text                        |
Indexes:
    "task_log_entries_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "task_log_entries_field_id_fkey" FOREIGN KEY (field_id) REFERENCES log_fields(id)
    "task_log_entries_task_id_fkey" FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE
    "task_log_entries_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL


app=> \d tasks
                                        Table "public.tasks"
     Column     |            Type             |                      Modifiers
----------------+-----------------------------+-----------------------------------------------------
 id             | integer                     | not null default nextval('fixins_id_seq'::regclass)
 fixin_id       | integer                     | not null
 created_on     | timestamp without time zone | not null default now()
 updated_on     | timestamp without time zone | not null default now()
 name           | character varying(200)      | not null
 description    | text                        |
 blocked_by     | character varying(200)      |
 estimate       | double precision            |
 actual         | double precision            |
 remaining      | double precision            |
 relative_order | integer                     |
 status_id      | integer                     | not null
 assignee_id    | integer                     |
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "tasks_assignee_id_fkey" FOREIGN KEY (assignee_id) REFERENCES users(id) ON DELETE SET NULL
    "tasks_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    "tasks_status_id_fkey" FOREIGN KEY (status_id) REFERENCES task_statuses(id)
Referenced by:
    TABLE "task_comments" CONSTRAINT "task_comments_task_id_fkey" FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE
    TABLE "task_log_entries" CONSTRAINT "task_log_entries_task_id_fkey" FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE
    TABLE "users_tasks_notifications" CONSTRAINT "users_tasks_notifications_task_id_fkey" FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE


app=> \d fixins
                                       Table "public.fixins"
     Column     |            Type             |                      Modifiers
----------------+-----------------------------+-----------------------------------------------------
 id             | integer                     | not null default nextval('fixins_id_seq'::regclass)
 project_id     | integer                     | not null
 created_on     | timestamp without time zone | not null default now()
 updated_on     | timestamp without time zone | not null default now()
 name           | character varying(200)      | not null
 description    | text                        | not null
 status_id      | integer                     | not null
 reporter_id    | integer                     |
 assignee_id    | integer                     |
 priority_id    | integer                     | not null
 severity_id    | integer                     | not null
 likelihood_id  | integer                     | not null
 maturity       | integer                     | not null default 0
 version        | character varying(100)      |
 iteration_id   | integer                     |
 relative_order | integer                     |
 kind           | character varying(16)       | not null default 'Bug'::character varying
 specs          | character varying(50)       |
 estimate       | double precision            |
 blocked_by     | character varying(200)      |
 plan_estimate  | double precision            |
 actual         | double precision            |
 remaining      | double precision            |
 promise_date   | date                        |
Indexes:
    "fixins_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fixins_assignee_id_fkey" FOREIGN KEY (assignee_id) REFERENCES users(id) ON DELETE SET NULL
    "fixins_iteration_id_fkey" FOREIGN KEY (iteration_id) REFERENCES iterations(id) ON DELETE SET NULL
    "fixins_likelihood_id_fkey" FOREIGN KEY (likelihood_id) REFERENCES likelihoods(id)
    "fixins_priority_id_fkey" FOREIGN KEY (priority_id) REFERENCES priorities(id)
    "fixins_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
    "fixins_reporter_id_fkey" FOREIGN KEY (reporter_id) REFERENCES users(id) ON DELETE SET NULL
    "fixins_severity_id_fkey" FOREIGN KEY (severity_id) REFERENCES severities(id)
    "fixins_status_id_fkey" FOREIGN KEY (status_id) REFERENCES statuses(id)
Referenced by:
    TABLE "bug_snapshots" CONSTRAINT "bug_snapshots_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE SET NULL
    TABLE "comments" CONSTRAINT "comments_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "customers_fixins" CONSTRAINT "customers_fixins_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id)
    TABLE "fixins_tags" CONSTRAINT "fixins_tags_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "log_entries" CONSTRAINT "log_entries_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "relationships" CONSTRAINT "relationships_fixin1_id_fkey" FOREIGN KEY (fixin1_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "relationships" CONSTRAINT "relationships_fixin2_id_fkey" FOREIGN KEY (fixin2_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "tasks" CONSTRAINT "tasks_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "users_notifications" CONSTRAINT "users_notifications_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id) ON DELETE CASCADE
    TABLE "votes" CONSTRAINT "votes_fixin_id_fkey" FOREIGN KEY (fixin_id) REFERENCES fixins(id)


app=> \d projects
                                     Table "public.projects"
     Column     |          Type           |                       Modifiers
----------------+-------------------------+-------------------------------------------------------
 id             | integer                 | not null default nextval('projects_id_seq'::regclass)
 name           | character varying(50)   | not null
 link_name      | character varying(50)   | not null
 pain_threshold | integer                 | not null
 wiki_server    | character varying(100)  |
 wiki_wiki      | character varying(100)  |
 wiki_pattern   | character varying(1000) |
 active         | boolean                 | not null default true
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
    "projects_link_name_key" UNIQUE, btree (link_name)
Referenced by:
    TABLE "fixins" CONSTRAINT "fixins_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
    TABLE "iterations" CONSTRAINT "iterations_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE
    TABLE "project_preferences" CONSTRAINT "project_preferences_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
    TABLE "projects_users_notifications" CONSTRAINT "projects_users_notifications_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)
    TABLE "releases" CONSTRAINT "releases_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE


app=> \d users
                                 Table "public.users"
  Column  |         Type          |                     Modifiers
----------+-----------------------+----------------------------------------------------
 id       | integer               | not null default nextval('users_id_seq'::regclass)
 name     | character varying(50) | not null
 email    | character varying(50) |
 active   | boolean               | not null default true
 passhash | character varying(40) |
 salt     | character varying(4)  |
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "comments" CONSTRAINT "comments_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "fixins" CONSTRAINT "fixins_assignee_id_fkey" FOREIGN KEY (assignee_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "fixins" CONSTRAINT "fixins_reporter_id_fkey" FOREIGN KEY (reporter_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "log_entries" CONSTRAINT "log_entries_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "project_preferences" CONSTRAINT "project_preferences_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
    TABLE "projects_users_notifications" CONSTRAINT "projects_users_notifications_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
    TABLE "task_comments" CONSTRAINT "task_comments_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "task_log_entries" CONSTRAINT "task_log_entries_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "tasks" CONSTRAINT "tasks_assignee_id_fkey" FOREIGN KEY (assignee_id) REFERENCES users(id) ON DELETE SET NULL
    TABLE "users_notifications" CONSTRAINT "users_notifications_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
    TABLE "users_tasks_notifications" CONSTRAINT "users_tasks_notifications_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE


app=> \d log_fields
          Table "public.log_fields"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 id     | integer                | not null
 name   | character varying(200) | not null
Indexes:
    "log_fields_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "log_entries" CONSTRAINT "log_entries_field_id_fkey" FOREIGN KEY (field_id) REFERENCES log_fields(id)
    TABLE "task_log_entries" CONSTRAINT "task_log_entries_field_id_fkey" FOREIGN KEY (field_id) REFERENCES log_fields(id)

A bit of a sanity check (I'm pulling at straws here). What happens if you remove your `GROUP` and `ORDER` by clauses (of course, removing the aggregate `SUM` too)? — slugonamission, Sep 23 '13 at 16:59
@slugonamission Good question! Removing those clauses and the `sum` shows rows only relating to two users (for that date range). — Phrogz, Sep 23 '13 at 17:09
My next guess would be to check that the query without the `JOIN` returns what you expect, but I think you've already done that. I'd have a look at using `EXPLAIN ...` to try and figure it out, but aside from that, I'm out of ideas. Good luck! — slugonamission, Sep 23 '13 at 17:20
Never used PostgreSQL before, but does the `date_trunc` function return `NULL` when given a null date? — randcd, Sep 23 '13 at 17:24
What if you remove the `WHERE` Clause? Do all 14 show up? Building on my last comment. — randcd, Sep 23 '13 at 17:35
Fully qualify your field names with the proper aliases. That would make it readable. (and probably make the problem come to the surface) — joop, Sep 23 '13 at 17:40
@randcd No, removing the `WHERE` clauses still only shows rows with the two users, not all 14. As for [`date_trunc`](http://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC), I believe that it does, but it doesn't matter for me: every row has a non-null `created_on` column. — Phrogz, Sep 23 '13 at 17:47
@joop Every field that I select in the end is aliased, and most others are as well; could you be more specific in what you mean? — Phrogz, Sep 23 '13 at 17:48
`SELECT user_name, sum(hour_delta) AS hours, changeday, project_name FROM (.....) ....` None of them is alias qualified. Makes it very hard to determine where they stem from. The same in the various subselects, and WHERE and ON clauses. — joop, Sep 23 '13 at 18:16
Please add table qualification to `WHERE field_id=18` to make it unambiguous. Or better yet, to *all* column names in the query - even more since you did not provide your table definitions. — Erwin Brandstetter, Sep 23 '13 at 18:24
@ErwinBrandstetter I have added table definitions at the bottom of the question. In this case `field_id` happens to be a unique column name in the query. — Phrogz, Sep 23 '13 at 19:12
"unambiguous" in my comment was meant for us on SO, not for Postgres - which knows which table the column belongs to. — Erwin Brandstetter, Sep 23 '13 at 19:16

Erwin Brandstetter · Accepted Answer · 2018-11-27T11:45:45.193

The query can probably be simplified to:

SELECT u.name AS user_name
     , p.name AS project_name
     , tl.created_on::date AS changeday
     , coalesce(sum(nullif(new_value, '')::numeric), 0)
     - coalesce(sum(nullif(old_value, '')::numeric), 0) AS hours
FROM   users             u
LEFT   JOIN (
        tasks            t 
   JOIN fixins           f  ON  f.id = t.fixin_id
   JOIN projects         p  ON  p.id = f.project_id
   JOIN task_log_entries tl ON  tl.task_id = t.id
                           AND  tl.field_id = 18
                           AND (tl.created_on IS NULL OR
                                tl.created_on >= '2013-09-08' AND
                                tl.created_on <  '2013-09-09') -- upper border!
       ) ON t.assignee_id = u.id
WHERE  EXISTS (SELECT 1 FROM tasks t1 WHERE t1.assignee_id = u.id)
GROUP  BY 1, 2, 3
ORDER  BY 1, 2, 3;

This returns all users that have ever had any task.
Plus data per projects and day where data exists in the specified date range in task_log_entries.

Major points

The aggregate function sum() ignores NULL values. COALESCE() per row is not required any more as soon as you recast the calculation as the difference of two sums:
```
 ,coalesce(sum(nullif(new_value, '')::numeric), 0) -
  coalesce(sum(nullif(old_value, '')::numeric), 0) AS hours
```
However, if it is possible that all columns of a selection have NULL or empty strings, wrap the sums into COALESCE once.
I am using numeric instead of float, safer alternative to minimize rounding errors.
Your attempt to get distinct values from the join of users and tasks is futile, since you join to task once more further down. Flatten the whole query to make it simpler and faster.
These positional references are just a notational convenience:
```
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
```
... doing the same as in your original query.
To get a date from a timestamp you can simply cast to date:
```
tl.created_on::date AS changeday
```
But it's much better to test with original values in the WHERE clause or JOIN condition (if possible, and it is possible here), so Postgres can use plain indices on the column (if available):
```
 AND (tl.created_on IS NULL OR
      tl.created_on >= '2013-09-08' AND
      tl.created_on <  '2013-09-09')  -- next day as excluded upper border
```
Note that a date literal is converted to a timestamp at 00:00 of the day at your current time zone. You need to pick the next day and exclude it as upper border. Or provide a more explicit timestamp literal like '2013-09-22 0:0 +2':: timestamptz. More on excluding upper border:
- Calculate number of concurrent events in SQL
- Find overlapping date ranges in PostgreSQL
For the requirement every user who has ever been assigned to a task add the WHERE clause:
```
WHERE EXISTS (SELECT 1 FROM tasks t1 WHERE t1.assignee_id = u.id)
```
Most importantly: A LEFT [OUTER] JOIN preserves all rows to the left of the join. Adding a WHERE clause on the right table can void this effect. Instead, move the filter expression to the JOIN clause. More explanation here:
- Query with LEFT JOIN not returning rows for count of 0
Parentheses can be used to force the order in which tables are joined. Rarely needed for simple queries, but very useful in this case. I use the feature to join task, fixins, projects and task_log_entries before left-joining all of it to users - without subquery.
Table aliases make writing complex queries easier.

+1 These are great tips—thank you!—but they don't give the results that I want. I don't want my two task joins to be futile, and I suppose that is the problem. I want to find "all users that have ever had a task assigned to them" and left join that with "all task log entries in a particular time frame" (which I then join with some other tables on order to find the user associated with the task, and ultimately the name of the project that the task is part of). I assumed that this was possible with a single query; was I wrong? I will edit the answer presently with schema of the tables involved. — Phrogz, Sep 23 '13 at 18:59
No glory. With your updated query I get one row for every `user_name` (yay!) but only `NULL` values for all other columns (oops). — Phrogz, Sep 23 '13 at 19:16
Nope; still gives nulls for all columns other than `user_name`. (FWIW every task will have a `fixin_id` and every fixin will have a `project_id`.) — Phrogz, Sep 23 '13 at 19:22
Ah; for some reason I need to modify the second date to be one day in the future with yours (some sort of fencepost problem with the way timestamps are being coerced to dates?). It looks like this is working with that adjustment. I'll accept the answer after further testing. — Phrogz, Sep 23 '13 at 19:34
@Phrogz: I added some more hints, especially concerning the upper border of the date range. — Erwin Brandstetter, Sep 23 '13 at 19:48

score 1 · Answer 2 · answered Sep 23 '13 at 17:24

1

It doesn't work because the first query is inner joined with tasks. The same table is than used to perform outer join (through subquery but nevertheless) but the first query (tasked users) doesn't have the relevant records in the first place (that lack the match).

Try using

....
FROM (
  SELECT DISTINCT
    users.id,
    users.name AS user_name
  FROM users    
) tasked_users
...

answered Sep 23 '13 at 17:24

nimdil

1,361
10
20

This looks like the problem, +1. When I drop that `INNER JOIN` from the sub-select it works as expected. So, how do I get the end result that I want, where I don't select ALL users, but only the users that have been tasked in the past? – Phrogz Sep 23 '13 at 17:44
But if dropping INNER JOIN from the first subquery solves the problem, what is the remaining problem? – nimdil Sep 23 '13 at 19:17
The problem is that the result includes rows for ALL users (some hundreds) not just rows for all users that have been given a task in the past. – Phrogz Sep 23 '13 at 19:21