2

So for the following schema:

CREATE TABLE activity (
    id integer NOT NULL,
    start_date date NOT NULL
);

CREATE TABLE account (
    id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE contact (
    id integer NOT NULL,
    account_id integer NOT NULL,
    name varchar NOT NULL
);

CREATE TABLE activity_contact (
    id integer NOT NULL,
    contact_id integer NOT NULL,
    activity_id integer NOT NULL
);

insert into activity(id, start_date)
values
(1, '2021-11-03'),
(2, '2021-10-03'),
(3, '2021-11-02');

insert into account(id, name)
values
(1, 'Test Account');

insert into contact(id, account_id, name)
values
(1, 1, 'John'),
(2, 1, 'Kevin');

insert into activity_contact(id, contact_id, activity_id)
values
(1, 1, 1),
(2, 2, 1),
(3, 2, 2),
(4, 1, 3);

You can see that there are 3 activities and each contact has two. What i am searching for is the number of activities per account in the previous two months. So I have the following query

SELECT contact.account_id AS accountid,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '1 mon'::interval)) AS last_month,
    count(*) FILTER (WHERE date_trunc('month'::text, activity.start_date) = date_trunc('month'::text, CURRENT_DATE - '2 mons'::interval)) AS prev_month
FROM activity
JOIN activity_contact ON activity_contact.activity_id = activity.id
JOIN contact ON contact.id = activity_contact.contact_id
JOIN account ON contact.account_id = account.id
GROUP BY contact.account_id;

This returns:

accountid   last_month  prev_month
    1           3           1

However this is incorrect. There are only 3 activities, its just that each contact sees activity 1. so it is counting that activity twice. Is there a way for me to only count each activity id one time so there is no duplication?

discodowney
  • 1,475
  • 6
  • 28
  • 58

1 Answers1

1

count(DISTINCT activity_id) to fold duplicates in the count, like Edouard suggested.
But there is more:

SELECT con.account_id AS accountid
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '1 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP)) AS last_month
     , count(DISTINCT aco.activity_id) FILTER (WHERE act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                                               AND   act.start_date <  date_trunc('month', LOCALTIMESTAMP - interval '1 mon')) AS prev_month
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN contact     con ON con.id = aco.contact_id
-- JOIN   account       acc ON con.account_id = acc.id  -- noise
GROUP  BY 1;

db<>fiddle here

  • Most importantly, add an outer WHERE clause to the query to filter irrelevant rows early. This can make a big difference for a small selection from a big table.
    We have to move that predicate to the JOIN clause, lest we'd exclude accounts with no activity. (LEFT JOIN and RIGHT JOIN can both be used, mirroring each other.) See:

  • Make that filter "sargable", so it can use an index on (start_date) (unlike your original formulation). Again, big impact for a small selection from a big table.

  • Use the same expressions for your aggregate filter clauses. Lesser effect, but take it.
    Unlike other aggregate functions, count() returns 0 (not NULL) for "no rows", so we don't have to do anything extra.

  • Assuming referential integrity (enforced with a FK constraint), the join to table account is just expensive noise. Drop it. CURRENT_DATE is not wrong. But since your expressions yield timestamp anyway, it's bit more efficient to use LOCALTIMESTAMP to begin with.

Compare with your original to see that this is quite a bit faster.

And I assume you are aware that this query introduces a dependency on the TimeZone setting of the executing session. The current date depends on where in the world we ask. See:


If you are not bound to this particular output format, a pivoted form is simpler, now that we filter rows early:

SELECT con.account_id AS accountid
     , date_trunc('month', act.start_date) AS mon
     , count(DISTINCT aco.activity_id) AS dist_count
FROM   activity         act
JOIN   activity_contact aco ON aco.activity_id = act.id
                           AND act.start_date >= date_trunc('month', LOCALTIMESTAMP - interval '2 mon')
                           AND act.start_date <  date_trunc('month', LOCALTIMESTAMP)
RIGHT  JOIN  contact    con ON con.id = aco.contact_id
GROUP  BY 1, 2
ORDER  BY 1, 2 DESC;

Again, we can include accounts without activity. But months without activity do not show up ...

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • thanks. yeah i noticed the timezone issue yesterday. thats on the list of things to fix. Ill have a look at that link. Cheers – discodowney Dec 09 '21 at 21:49
  • 1
    @discodowney: Either work with `timestamptz` or define the time zone *explicitly* (and hence the resulting date) for the query to get rid of the dependency (and possible corner case errors). Added an alt query while being at it. – Erwin Brandstetter Dec 09 '21 at 21:57
  • thats interesting. the format is done to make the go server parsing as easy as possible but its good to see this. ill have a think about whats best. Thanks for the help – discodowney Dec 09 '21 at 22:00
  • one further question. If i added a second account but added no activities, but i wanted it to show in the result with a count of 0 in both. Could i do that? im not seeing how i would manage it, im thinking id need a separate query for that calculation – discodowney Dec 10 '21 at 17:09
  • @discodowney: All doable in a single query with `LEFT JOIN`. You add an account with no activities. With or without entries in table `contact`? – Erwin Brandstetter Dec 10 '21 at 21:54
  • The account will have contacts, yes – discodowney Dec 13 '21 at 15:13
  • 1
    @discodowney: I update the answer to include accounts with no activity, as it fits in easily. Next time, please just ask another *question*. You can always link to earlier questions for context. – Erwin Brandstetter Dec 13 '21 at 16:39