Ensuring no dupe ids in query return

Question

So for the following schema:

CREATE TABLE activity (
    id integer NOT NULL,
    start_date date NOT NULL
);

CREATE TABLE account (
    id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE contact (
    id integer NOT NULL,
    account_id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE activity_contact (
    id integer NOT NULL,
    contact_id integer NOT NULL,
    activity_id integer NOT NULL
);

insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');

insert into account(id, name)
values
(1, 'Test Account');

insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');

insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);

You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query

SELECT contact.account_id AS accountid,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;

This returns:

accountid   last_month  prev_month
    1           3           1

However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?

replace `count(*)` by `count(DISTINCT activity_id)` – Edouard Dec 09 '21 at 20:29 — Edouard, Dec 09 '21 at 20:29
Fantastic stuff chief. Thanks – discodowney Dec 09 '21 at 20:30 — discodowney, Dec 09 '21 at 20:30

Erwin Brandstetter · Accepted Answer · 2021-12-13T16:46:41.627

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:

SELECT con.account_id AS accountid
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP)) AS last_month
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN contact     con ON con.id = aco.contact_id
-- JOIN   account       acc ON con.account_id = acc.id  -- noise
GROUP  BY 1;

db<>fiddle here

Most importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
We have to move that predicate to the JOIN clause, lest we'd exclude accounts with no activity. (LEFT JOIN and RIGHT JOIN can both be used, mirroring each other.) See:
- Postgres Left Join with where condition
- Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail
Make that filter "sargable", so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.
Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
Unlike other aggregate functions, count() returns 0 (not NULL) for "no rows", so we don't have to do anything extra.
Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it. CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it's bit more efficient to use LOCALTIMESTAMP to begin with.

Compare with your original to see that this is quite a bit faster.

And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:

Ignoring time zones altogether in Rails and PostgreSQL

If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:

SELECT con.account_id AS accountid
     , date_trunc('month', act.start_date) AS mon
     , count(DISTINCT aco.activity_id) AS dist_count
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN  contact    con ON con.id = aco.contact_id
GROUP  BY 1, 2
ORDER  BY 1, 2 DESC;

Again, we can include accounts without activity. But months without activity do not show up ...

thanks. yeah i noticed the timezone issue yesterday. thats on the list of things to fix. Ill have a look at that link. Cheers — discodowney, Dec 09 '21 at 21:49
@discodowney: Either work with `timestamptz` or define the time zone *explicitly* (and hence the resulting date) for the query to get rid of the dependency (and possible corner case errors). Added an alt query while being at it. — Erwin Brandstetter, Dec 09 '21 at 21:57
thats interesting. the format is done to make the go server parsing as easy as possible but its good to see this. ill have a think about whats best. Thanks for the help — discodowney, Dec 09 '21 at 22:00
one further question. If i added a second account but added no activities, but i wanted it to show in the result with a count of 0 in both. Could i do that? im not seeing how i would manage it, im thinking id need a separate query for that calculation — discodowney, Dec 10 '21 at 17:09
@discodowney: All doable in a single query with `LEFT JOIN`. You add an account with no activities. With or without entries in table `contact`? — Erwin Brandstetter, Dec 10 '21 at 21:54
@discodowney: I update the answer to include accounts with no activity, as it fits in easily. Next time, please just ask another *question*. You can always link to earlier questions for context. — Erwin Brandstetter, Dec 13 '21 at 16:39

Ensuring no dupe ids in query return

1 Answers1