5

I'm trying to submit a query in Postgres that only returns distinct tuples. In my sample query, I do not want duplicate entries where an entry exists multiple times for a cluster_id/feed_id combination. If I do a simple:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info 
   join cluster_info on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16';

I get just that, but I'd also like to group according to num_docs. So, when I do the following:

select distinct on (cluster_info.cluster_id, feed_id) 
   cluster_info.cluster_id, num_docs, feed_id, url_time 
   from url_info join cluster_info 
   on (cluster_info.cluster_id = url_info.cluster_id) 
   where feed_id in (select pot_seeder from potentials) 
   and num_docs > 5 and url_time > '2012-04-16' 
   order by num_docs desc;

I get the following error:

ERROR:  SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste...

I think I understand why I'm getting the error (cannot group by tuples unless I explicitly describe the group somehow) but how do I do that? Or if I am incorrect in my interpretation of the error, is there a way to accomplish my initial goal?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
WildBill
  • 9,143
  • 15
  • 63
  • 87

1 Answers1

11

The leftmost ORDER BY items cannot disagree with the items of the DISTINCT clause. I quote the manual about DISTINCT:

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

Try:

SELECT *
FROM  (
    SELECT DISTINCT ON (c.cluster_id, feed_id) 
           c.cluster_id, num_docs, feed_id, url_time 
    FROM   url_info u
    JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
    WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
    AND    num_docs > 5
    AND    url_time > '2012-04-16'
    ORDER  BY c.cluster_id, feed_id, num_docs, url_time
           -- first columns match DISTINCT
           -- the rest to pick certain values for dupes
           -- or did you want to pick random values for dupes?
    ) x
ORDER  BY num_docs DESC;

Or use GROUP BY:

SELECT c.cluster_id
     , num_docs
     , feed_id
     , url_time 
FROM   url_info u
JOIN   cluster_info c ON (c.cluster_id = u.cluster_id) 
WHERE  feed_id IN (SELECT pot_seeder FROM potentials) 
AND    num_docs > 5
AND    url_time > '2012-04-16'
GROUP  BY c.cluster_id, feed_id 
ORDER  BY num_docs DESC;

If c.cluster_id, feed_id are the primary key columns of all (both in this case) tables that you include columns from in the SELECT list, then this just works with PostgreSQL 9.1 or later.

Else you need to GROUP BY the rest of the columns or aggregate or provide more information.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I think I need the GROUP BY as you mentioned as I'm getting the following error with your second answer: ERROR: column "c.num_docs" must appear in the GROUP BY clause or be used in an aggregate function – WildBill Apr 21 '12 at 21:25
  • Your first answer gives the following error: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 3: SELECT DISTINCT ON (c.cluster_id, feed_id) – WildBill Apr 21 '12 at 21:26
  • @WildBill: You may have missed the update on the first query. I fixed a mistake in my first version. As to the second query: If you provided the missing information which columns belong to which table and which primary keys and your version of PostgreSQL, my answer could be more specific. – Erwin Brandstetter Apr 21 '12 at 22:21
  • Thanks for your help and references! – WildBill Apr 21 '12 at 22:26
  • This was insanely slow for a table with just 28M rows in it... Is there a faster solution? After a join there's a lot of duplicates on the variation in column values. We want to order by these columns and return distinct rows based on the primary key (id). Doesn't seem to be a performance acceptable way of doing it... – Kevin Parker Dec 11 '19 at 05:43