16

I have a data set of email addresses and dates that those email addresses were added to a table. There can be multiple entries of an email address for various different dates. For example, if I have the data set below. I would be looking to get the date and count of distinct emails that we have between said date and 3 days ago.

Date   | email  
-------+----------------
1/1/12 | test@test.com
1/1/12 | test1@test.com
1/1/12 | test2@test.com
1/2/12 | test1@test.com
1/2/12 | test2@test.com
1/3/12 | test@test.com
1/4/12 | test@test.com
1/5/12 | test@test.com
1/5/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test1@test.com

Result set would look something like this if we use a date period of 3

date   | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 1
1/6/12 | 2

I can get a distinct count of a date range using the query below, but looking to get a count of a range by day so I do not have to manually update the range for hundreds of dates.

select test.date, count(distinct test.email)  
from test_table as test  
where test.date between '2012-01-01' and '2012-05-08'  
group by test.date;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
harold
  • 161
  • 1
  • 1
  • 3

5 Answers5

16

Test case:

CREATE TABLE tbl (date date, email text);
INSERT INTO tbl VALUES
  ('2012-01-01', 'test@test.com')
, ('2012-01-01', 'test1@test.com')
, ('2012-01-01', 'test2@test.com')
, ('2012-01-02', 'test1@test.com')
, ('2012-01-02', 'test2@test.com')
, ('2012-01-03', 'test@test.com')
, ('2012-01-04', 'test@test.com')
, ('2012-01-05', 'test@test.com')
, ('2012-01-05', 'test@test.com')
, ('2012-01-06', 'test@test.com')
, ('2012-01-06', 'test@test.com')
, ('2012-01-06', 'test1@test.com`')
;

Query - returns only days where an entry exists in tbl:

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN t.date - 2 AND t.date -- period of 3 days
      ) AS dist_emails
FROM   tbl t
WHERE  date BETWEEN '2012-01-01' AND '2012-01-06'  
GROUP  BY 1
ORDER  BY 1;

Or - return all days in the specified range, even if there are no rows for the day:

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN g.date - 2 AND g.date
      ) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date);

db<>fiddle here

Result:

day        | dist_emails
-----------+------------
2012-01-01 | 3
2012-01-02 | 3
2012-01-03 | 3
2012-01-04 | 3
2012-01-05 | 1
2012-01-06 | 2

This sounded like a job for window functions at first, but I did not find a way to define the suitable window frame. Also, per documentation:

Aggregate window functions, unlike normal aggregate functions, do not allow DISTINCT or ORDER BY to be used within the function argument list.

So I solved it with correlated subqueries instead. I guess that's the smartest way.

BTW, "between said date and 3 days ago" would be a period of 4 days. Your definition is contradictory there.

Slightly shorter, but slower for few days:

SELECT g.date, count(DISTINCT email) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date)
LEFT   JOIN tbl t ON t.date BETWEEN g.date - 2 AND g.date
GROUP  BY 1
ORDER  BY 1;

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

A lateral join is useful for such "sliding window" needs, like this:

SELECT
       t.day
     , ljl.dist_emails
FROM   tbl t
LEFT JOIN LATERAL (
        SELECT
               count(DISTINCT email) as dist_emails
        FROM   tbl
        WHERE  day BETWEEN t.day - 2 AND t.day -- period of 3 days
       ) AS ljl ON TRUE
WHERE t.day BETWEEN '2012-01-01' AND '2012-01-06' 

Note this is a variant to a previous query by Erwin Brandstetter, and it surprises me he hadn't suggested it, but these lateral joins excellent for this type of need.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
0

in sql server :

`select test.date, count(distinct test.email) from test_table as test  where convert(date,test.date) between '2012-01-01' and '2012-05-08' group by test.date`

hope this helps.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
RoMEoMusTDiE
  • 4,739
  • 1
  • 17
  • 26
  • 1
    This is labelled PostgreSQL. No `convert()` function here. (Well, there is a `convert()` function, but it is for converting encodings, not data types like in SQL server.) Also, signatures are discouraged, your flair at the bottom right takes care of that. See [here](http://meta.stackexchange.com/questions/5029/are-taglines-signatures-disallowed) or [here](http://stackoverflow.com/faq#signatures). – Erwin Brandstetter May 11 '12 at 23:29
0

Instead of specifying the dates, you could always use a dateadd function:

test.date > dateadd(dd,-7,getdate())
Toon Krijthe
  • 52,876
  • 38
  • 145
  • 202
JMEls
  • 17
  • 4
0

An example for sliding window distinct count:

SELECT b.day, count(DISTINCT a.user_id)
from glip_production.presences_1d a,
 (SELECT distinct(day), TIMESTAMPADD(day,-6, day) dt_start
  from glip_production.presences_1d t1) b
where a.day >= b.dt_start and a.day <= b.day and b.day > '2017-11-01'
group by b.day