-1

I got 2 tables; domains and events. Im trying to create a query that returns a list of distinct domains that is ordered by the oldest events (for that domain) with nulls first and distinct domain.

Basically this query will do the job:

SELECT * FROM domains left join events on events.attributes ->> 'domain' = domains.domain AND events.name = 'event1' WHERE parent is null ORDER BY domain, moment asc nulls first;

But the output is not distinct on 'domain'. When using 'distinct on', it gives the wrong output where the timestamp (moment) is not the latest one for that domain in the events table:

SELECT distinct on (domain) domain,moment FROM domains left join events on events.attributes ->> 'domain' = domains.domain AND events.name = 'event1' WHERE parent is null ORDER BY domain, moment asc nulls first;

Nor does this seem to work:

SELECT * FROM (SELECT DISTINCT on (domain) domain,moment,parent FROM "domains" left join events on events.attributes ->> 'domain' = domains.domain AND events.name = 'event1') AS domains ORDER BY moment asc nulls first;

The following code simulates the database and the query that i need to have distinct on without having distinct modify the order of the rows:

(Oh, and as you will see, the parent needs to be null so it only selects top domains. But thats a simple 'where'.)

create table domains (domain text, parent text);
        insert into domains (domain, parent) values ('stackoverflow.com', null);
        insert into domains (domain, parent) values ('test.stackoverflow.com', 'stackoverflow.com');
        insert into domains (domain, parent) values ('github.com', null);
        insert into domains (domain, parent) values ('example.com', null);
        insert into domains (domain, parent) values ('google.com', null);
        
create table events (name text, attributes jsonb, moment timestamp with time zone);
        insert into events (name, attributes, moment) values('event1', '{"domain": "example.com"}', '2011-01-01');
        insert into events (name, attributes, moment) values('event1', '{"domain": "github.com"}', '2012-01-01');
        insert into events (name, attributes, moment) values('event1', '{"domain": "stackoverflow.com"}', '2013-01-01');
        insert into events (name, attributes, moment) values('event1', '{"domain": "example.com"}', '2014-01-01');
        insert into events (name, attributes, moment) values('event1', '{"domain": "stackoverflow.com"}', '2015-01-01');

SELECT * FROM domains left join events on events.attributes ->> 'domain' = domains.domain AND events.name = 'event1' WHERE parent is null ORDER BY domain, moment asc nulls first;

How do I get this to work?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
003random
  • 1
  • 1
  • 1
    Why JSON use a normalised tables – nbk Oct 05 '21 at 22:42
  • there are different types of events. and the attributes differ. hence those are JSON – 003random Oct 05 '21 at 22:43
  • 1
    "oldest events (for that domain)" "timestamp (moment) is not the latest one" Well, do you want the oldest, or do you want the latest? Those are opposite things. – jjanes Oct 06 '21 at 00:50

2 Answers2

0

Have you considered using a row_number window function? Something like

SELECT *
FROM
(
SELECT *, row_number() OVER (PARTITION BY events.attributes->>'domain' ORDER BY moment ASC) rn FROM domains left join events on events.attributes ->> 'domain' = domains.domain AND events.name = 'event1' WHERE parent is null
) a
WHERE rn = 1
ORDER BY domain, moment asc nulls first

The inner query uses row_number() OVER (PARTITION BY events.attributes->>'domain' ORDER BY moment ASC) rn to create a field that, for each events.attributes->>'domain' grouping, numbers things in order of events.moment. The outer query simply limits to the first one for each grouping and does the final ordering.

EdmCoff
  • 3,506
  • 1
  • 9
  • 9
  • Yes! Very clever way of doing it. This exact query (but with DESC instead of ASC) seems to produce the output I was after. Thank you! – 003random Oct 05 '21 at 23:53
  • @003random: But it does not produce the requested sort order (regardless of how we interpret ambiguities in the question)? https://dbfiddle.uk/?rdbms=postgres_13&fiddle=f861938e238d086a501d4bb44b18461d – Erwin Brandstetter Oct 06 '21 at 01:44
0

with nulls first

That's ambiguous because there are two distinct sources of NULL for the moment in your queries while events.moment can be NULL - which we don't know as there's no table definition:

  • LEFT JOIN finds no qualifying row to the right.
  • LEFT JOIN finds a qualifying row to the right, but moment IS NULL.

Assuming events.moment is defined NOT NULL, domains.domain is UNIQUE, and you want to list domains without qualifying events first and the rest by oldest event first.

Select the oldest event per qualifying domain in a subquery. It's faster to aggregate before the join to begin with. See:

SELECT d.domain, e.moment
FROM   domains d
LEFT   JOIN (
   SELECT DISTINCT ON (1)
          attributes ->> 'domain' AS domain, moment
   FROM   events
   WHERE  name = 'event1'
   ORDER  BY 1, moment
   ) e USING (domain)
WHERE  d.parent IS NULL
ORDER  BY e.moment NULLS FIRST, d.domain;

I added domain to ORDER BY, to sort ties alphabetically.
Or, while we need nothing else from events, just:

SELECT d.domain, e.moment
FROM   domains d
LEFT   JOIN (
   SELECT attributes ->> 'domain' AS domain, min(moment) AS moment
   FROM   events
   WHERE  name = 'event1'
   GROUP  BY 1
   ) e USING (domain)
WHERE  d.parent IS NULL
ORDER  BY e.moment NULLS FIRST, d.domain;

db<>fiddle here

See:

If there are many domains with name = 'event1' in events that do not contribute to this query, then a LATERAL subquery is faster. See:

Depending on undisclosed data distribution and cardinalities, there may be (much) faster solutions. row_number() is never fastest for this. I have tried many times.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228