0

Here is a simplified description of 2 tables:

CREATE TABLE jobs(id PRIMARY KEY, description);
CREATE TABLE dates(id PRIMARY KEY, job REFERENCES jobs(id), date);

There may be one or more dates per job.

I would like create a query which generates the following (in pidgin):

jobs.id, jobs.description, min(dates.date) as start, max(dates.date) as finish

I have tried something like this:

SELECT id, description,
      (SELECT min(date) as start  FROM dates d WHERE d.job=j.id),
      (SELECT max(date) as finish FROM dates d WHERE d.job=j.id)
FROM jobs j;

which works, but looks very inefficient.

I have tried an INNER JOIN, but can’t see how to join jobs with a suitable aggregate query on dates.

Can anybody suggest a clean efficient way to do this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Manngo
  • 14,066
  • 10
  • 88
  • 110
  • You can aggregate already joined "table". Most likely will use `GROUP BY`. – PM 77-1 Jan 25 '16 at 00:37
  • @PM77-1 Something like this: `SELECT min(jobs.id),min(jobs.description), min(dates.date) AS start, max(dates.data) AS finish FROM obs INNER JOIN dates ON jobs.id=dates.job GROUP BY jobs.id`. I thought of that, but wondered whether use of `min` for the `job` items was wasteful. – Manngo Jan 25 '16 at 00:43
  • You do not have to aggregate fields you use for grouping. – PM 77-1 Jan 25 '16 at 01:03
  • @PM77-1 You’re right, of course. The corrected statement would be: `SELECT jobs.id,min(jobs.description), min(dates.date) AS start, max(dates.data) AS finish FROM obs INNER JOIN dates ON jobs.id=dates.job GROUP BY jobs.id`. However, in my real application, there are additional fields from the `jobs` table which would also need to be listed, similar to the `description` above. – Manngo Jan 25 '16 at 01:20
  • As always, your Postgres version should be declared. – Erwin Brandstetter Jan 25 '16 at 03:28

2 Answers2

1

While retrieving all rows: aggregate first, join later:

SELECT id, j.description, d.start, d.finish
FROM   jobs j
LEFT   JOIN (
   SELECT job AS id, min(date) AS start, max(date) AS finish 
   FROM   dates 
   GROUP  BY job
   ) d USING (id);

Related:

About JOIN .. USING

It's not a "different type of join". USING (col) is a standard SQL (!) syntax shortcut for ON a.col = b.col. More precisely, quoting the manual:

The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON *T1*.a = *T2*.a AND *T1*.b = *T2*.b.

Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

It's particularly convenient that you can write SELECT * FROM ... and joining columns are only listed once.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    I haven’t seen this type of join before. What is `USING` and why won’t `ON d.job=j.id) work? – Manngo Jan 25 '16 at 01:25
  • @Manngo: Consider the added explanation. `ON d.job=j.id` does not work here, because there is no column `job` in the derived table `d` after I have applied the column alias `id` (with `job AS id`). Either remove the alias or use instead: `ON d.id = j.id` if you can't acquaint yourself with the convenient shortcut `USING (id)`. – Erwin Brandstetter Jan 25 '16 at 03:34
  • Well, I learned two things for the price of one here. Aggregate before the join, and the USING clause. It also seems that you can get away with aliasing a column to the same as a column name which allows you to include such weirdness as `SELECT id, job AS id FROM dates` which is probably totally useless, but which would help explain why you can’t use an alias in a calculated expression. Thanks for your excellent solution. – Manngo Jan 27 '16 at 05:35
1

In addition to Erwin's solution, you can also use a window clause:

SELECT j.id, j.description,
       first_value(d.date) OVER w AS start,
       last_value(d.date) OVER w AS finish
FROM jobs j
JOIN dates d ON d.job = j.id
WINDOW w AS (PARTITION BY j.id ORDER BY d.date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Window functions effectively group by one or more columns (the PARTITION BY clause) and/or ORDER BY some other columns and then you can apply some window function to it, or even a regular aggregate function, without affecting grouping or ordering of any other columns (description in your case). It requires a somewhat different way of constructing queries, but once you get the idea it is pretty brilliant.

In your case you need to get the first value of a partition, which is easy because it is accessible by default. You also need to look beyond the window frame (which ends by default with the current row) to the last value in the partition and then you need the ROWS clause. Since you produce two columns using the same window definition, the WINDOW clause is used here; in case it applies to a single column you can just write the window function in the select list followed by the OVER clause and the window definition without its name (WINDOW w AS (...)).

Community
  • 1
  • 1
Patrick
  • 29,357
  • 6
  • 62
  • 90
  • This goes beyond my own experience with window functions (limited to subtotals and getting row numbers). Thanks for this — I’ll need to spend some time learning more about this. – Manngo Jan 26 '16 at 23:47