Eliminate duplicate rows in a PostgreSQL SELECT statement

Question

This is my query:

SELECT autor.entwickler,anwendung.name
  FROM autor 
  left join anwendung
    on anwendung.name = autor.anwendung;

 entwickler |    name     
------------+-------------
 Benutzer 1 | Anwendung 1
 Benutzer 2 | Anwendung 1
 Benutzer 2 | Anwendung 2
 Benutzer 1 | Anwendung 3
 Benutzer 1 | Anwendung 4
 Benutzer 2 | Anwendung 4
(6 rows)

I want to keep one row for each distinct value in the field name, and discard the others like this:

 entwickler |    name     
------------+-------------
 Benutzer 1 | Anwendung 1
 Benutzer 2 | Anwendung 2
 Benutzer 1 | Anwendung 3
 Benutzer 1 | Anwendung 4

In MySQL I would just do:

SELECT autor.entwickler,anwendung.name
  FROM autor
  left join anwendung
    on anwendung.name = autor.anwendung
 GROUP BY anwendung.name;

But PostgreSQL gives me this error:

ERROR: column "autor.entwickler" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT autor.entwickler FROM autor left join anwendung on an ...

I totally understand the error and assume that the mysql implementation is less SQL conform than the postgres implementation. But how can I get the desired result?

Your MySQL example works with in non-standard-SQL mode, and PostgreSQL use standard SQL... To compare, you must use `ONLY_FULL_GROUP_BY` mode in MySQL. Even in MySQL you need an aggregate sampler function (`ANY_VALUE` commented by Craig Ringer)... See also http://dba.stackexchange.com/a/133747/90651 — Peter Krauss, Mar 30 '16 at 03:08

Craig Ringer · Accepted Answer · 2014-10-03T02:01:07.697

PostgreSQL doesn't currently allow ambiguous GROUP BY statements where the results are dependent on the order the table is scanned, the plan used, etc. That's how the standard says it should work AFAIK, but some databases (like MySQL versions prior to 5.7) permit looser queries that just pick the first value encountered for elements appearing in the SELECT list but not in GROUP BY.

In PostgreSQL, you should use DISTINCT ON for this kind of query.

You want to write something like:

SELECT DISTINCT ON (anwendung.name) anwendung.name, autor.entwickler
FROM author 
left join anwendung on anwendung.name = autor.anwendung;

(Syntax corrected based on follow-up comment)

This is a bit like MySQL 5.7's ANY_VALUE(...) pseudo-function for group by, but in reverse - it says that the values in the distinct on clause must be unique, and any value is acceptable for the columns not specified.

Unless there's an ORDER BY, there is no gurantee as to which values are selected. You should usually have an ORDER BY for predictability.

It's also been noted that using an aggregate like min() or max() would work. While this is true - and will lead to reliable and predictable results, unlike using DISTINCT ON or an ambigious GROUP BY - it has a performance cost due to the need for extra sorting or aggregation, and it only works for ordinal data types.

thank you that put me on the right trac. the correct query loos like: SELECT DISTINCT ON (anwendung.name) anwendung.name,autor.entwickler FROM autor left join anwendung on anwendung.name = autor.anwendung ; — The Surrican, Dec 04 '11 at 14:07
The "DISTINCT ON" solution is more interesting in a sense. But I think "MIN" (or MAX) is a better canonical solution to this problem. — mdahlman, Dec 04 '11 at 15:05
I’m not sure whether `min()` or `max()` would have a much larger performance cost than `distinct`, given that `distinct` appears to sort the data anyway in order to find duplicates. If you `explain` the above query you’ll see there is a `Sort` node at the top of the tree. — jbg, Feb 12 '14 at 00:50

Erwin Brandstetter · Answer 2 · 2018-03-15T19:52:18.663

Craig's answer and your resulting query in the comments share the same flaw: The table anwendung is at the right side of a LEFT JOIN, which contradicts your obvious intent. You care about anwendung.name and pick autor.entwickler arbitrarily. I'll come back to that further down.

It should be:

SELECT DISTINCT ON (1) an.name, au.entwickler
FROM   anwendung an
LEFT   JOIN autor au ON an.name = au.anwendung;

_{DISTINCT ON (1) is just a syntactical shorthand for DISTINCT ON (an.name). Positional references are allowed here.}

If there are multiple developers (entwickler) for an app (anwendung) one developer is picked arbitrarily. You have to add an ORDER BY clause if you want the "first" (alphabetically according to your locale):

SELECT DISTINCT ON (1) an.name, au.entwickler
FROM   anwendung an
LEFT   JOIN autor au ON an.name = au.anwendung
ORDER  BY 1, 2;

As @mdahlman implied, a more canonical way would be:

SELECT an.name, min(au.entwickler) AS entwickler
FROM   autor au
LEFT   JOIN anwendung an ON an.name = au.anwendung
GROUP  BY an.name;

Or, better yet, clean up your data model, implement the n:m relationship between anwendung and autor properly, add surrogate primary keys as anwendung and autor are hardly unique, enforce relational integrity with foreign key constraints and adapt your resulting query:

The proper way

CREATE TABLE autor (
   autor_id serial PRIMARY KEY -- surrogate primary key
 , autor    text NOT NULL);

INSERT INTO autor  VALUES
   (1, 'mike')
 , (2, 'joe')
 , (3, 'jane')   -- worked on two apps
 , (4, 'susi');  -- has no part in any apps (yet)

CREATE TABLE anwendung (
   anwendung_id serial PRIMARY KEY -- surrogate primary key
 , anwendung    text  UNIQUE);     -- disallow duplicate names

INSERT INTO anwendung  VALUES
   (1, 'foo')    -- has 3 authors linked to it
 , (2, 'bar')
 , (3, 'shark')
 , (4, 'bait');  -- has no authors attached to it (yet).

CREATE TABLE autor_anwendung (  -- you might name this table "entwickler"
   autor_id     integer REFERENCES autor     ON UPDATE CASCADE ON DELETE CASCADE
 , anwendung_id integer REFERENCES anwendung ON UPDATE CASCADE ON DELETE CASCADE
 , PRIMARY KEY (autor_id, anwendung_id)
);

INSERT INTO autor_anwendung VALUES
 (1, 1)
,(2, 1)
,(3, 1)
,(2, 2)
,(3, 3);

This query retrieves one row per app with one associated author (the 1st one alphabetically) or NULL if there are none:

SELECT DISTINCT ON (1) an.anwendung, au.autor
FROM   anwendung an
LEFT   JOIN autor_anwendung au_au USING (anwendung_id)
LEFT   JOIN autor au USING (autor_id)
ORDER  BY 1, 2;

Result:

 name  | entwickler
-------+-----------------
 bait  |
 bar   | joe
 foo   | jane
 shark | jane

Interessting note on the `DISTINC ON (1)` construct. Never seen this before. — DrColossos, Dec 05 '11 at 06:47

Eliminate duplicate rows in a PostgreSQL SELECT statement

2 Answers2

The proper way

Linked