GROUP or DISTINCT after JOIN returns duplicates

Question

I have two tables, products and meta. They are in relation 1:N where each product row has at least one meta row via foreign key.

(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)

I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):

SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id

but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.

SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id

I have tried many approaches like trying to DISTINCT or GROUP BY on sub-query but always with same result.

Which 4 from the 8 rows in META would you like to return? Any? What is the criteria? — Clodoaldo Neto, Aug 25 '14 at 13:31
I mean, product_id #1 has 3 corresponding rows in meta. Which one? — Clodoaldo Neto, Aug 25 '14 at 13:33
Oops, i forgot to add it, i need for example the one meta with latest date in my real case but there it could be just only the meta with highest id. — OKE, Aug 25 '14 at 13:39

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

8

While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

SELECT *
FROM   products p
JOIN  (
   SELECT DISTINCT ON (product_id) *
   FROM   meta
   ORDER  BY product_id, id DESC
   ) m ON m.product_id = p.id;

The more rows in meta per row in products, the bigger the impact on performance.

Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

SQL Fiddle.

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Optimize performance

Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM  (
   SELECT product_id, max(id) AS meta_id
   FROM   meta
   GROUP  BY 1
   ) sub
JOIN meta     m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;

If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

If performance is your paramount requirement, consider more options:

a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).
This is the only way I know to use an index for a DISTINCT query over the whole table.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 25 '14 at 14:14

Erwin Brandstetter

605,456
145
1,078
1,228

1

Since the `distinct` is `on` a META column I would guess the planner is smart enough to do it first by himself. Am I unreazonable optimistic? – Clodoaldo Neto Aug 25 '14 at 14:19
1

Answering my own question after checking the `explain` output. No it is not smart enough – Clodoaldo Neto Aug 25 '14 at 14:27
@ClodoaldoNeto: Yep, that's what my tests keep showing, too - which is the sole reason for my added answer. – Erwin Brandstetter Aug 25 '14 at 14:28
Your output is incomplete. The meta.id column is missing as it would have a duplicate name. Better to change to 'SELECT p.*, meta_id, m.* FROM....'. – youngthing Aug 25 '14 at 15:22
@youngthing: `meta_id` is the same as `m.id` and would be redundant. – Erwin Brandstetter Aug 25 '14 at 15:25
@ErwinBrandstetter: have you tested it? http://sqlfiddle.com/#!15/c8f34/48 suggests otherwise.. – youngthing Aug 25 '14 at 15:36
@youngthing: That's because the name `id` is in the result twice and JDBC (which sqlfiddle is based on) swallows the second (bug!). The same does not happen in my tests with psql or pgAdmin. Wouldn't happen if we wouldn't use the non-descriptive `id` as column names (I ***never*** do this nonsense, but half-wit ORMs do.) Consider: http://sqlfiddle.com/#!15/0a77d/3 – Erwin Brandstetter Aug 25 '14 at 15:56
@ErwinBrandstetter: agreed, looks like a bug IMHO in JDBC. PostgreSQL produces two columns with the name 'id'. – youngthing Aug 25 '14 at 16:00
Thank you (and others with same solution) but your answer is most explicit and i like both correct/optimized answers. And also, it is working as expected. I never heard of DISTINCT ON before, as I was using the MySQL and migrated to PgSQL recently. And as both my develop, test and production enviroments run PgSQL 9.* it's ok. – OKE Aug 26 '14 at 05:58

score 5 · Answer 2 · answered Aug 25 '14 at 13:42

I think you might be looking for DISTINCT ON, a PostgreSQL extension feature:

SELECT 
  DISTINCT ON(product_id)
  * 
FROM meta 
INNER JOIN products ON products.id = meta.product_id;

http://sqlfiddle.com/#!15/c8f34/18

However, note that without an ORDER BY the results are not guaranteed to be consistent; the database can pick any row it wants from the matching rows.

score 2 · Answer 3 · edited May 23 '17 at 12:10

2

Use distinct on as suggested by @Craig's answer but combined with the order by clause as explicated in the comments. SQL Fiddle

select distinct on(m.product_id) * 
from
    meta m
    inner join
    products p on p.id = m.product_id
order by m.product_id, m.id desc;

edited May 23 '17 at 12:10

Community

1
1

answered Aug 25 '14 at 13:49

Clodoaldo Neto

118,695
26
233
260

youngthing · Answer 4 · 2014-08-25T14:38:16.453

0

You can use a subquery to identify the max(ID) for each product, then use that in the superquery to gather the details you want to display:

SELECT q.product_id, meta.* from
(SELECT product_id, max(meta.ID)
 FROM meta JOIN products ON products.id=meta.product_id 
 GROUP BY product_id) q 
JOIN meta ON q.max=meta.id;

It is not the only solution!

A quick comparison to use of DISTINCT ON solutions suggests that it is slower (http://sqlfiddle.com/#!15/c8f34/38). It avoids a full sort on ID and prefers a sequential scan.

edited Aug 25 '14 at 14:38

answered Aug 25 '14 at 13:39

youngthing

384
2
13

Edited to add a comparison to a "DISTINCT ON" strategy. – youngthing Aug 25 '14 at 14:10
1

Your query exhibits two weak spots. 1. Does not return all columns from `products` like the question. 2. Slower than necessary. Aggregate *first*, join to products *later* (only good if most or all of `meta` is involved in the query). I added code to my answer, plus more on performance. – Erwin Brandstetter Aug 25 '14 at 14:58
True. I really like your solution, though I think it needs one adjustment to the output. – youngthing Aug 25 '14 at 15:21

GROUP or DISTINCT after JOIN returns duplicates

4 Answers4

Optimize performance

Linked