1

Good evening dear Stack overflow community. This is my first question here.

I have the following issue: I need a query to count distinct values in a rolling date range (3 days) for each country.

I have done some research and found the following discussion here [1]: Query for count of distinct values in a rolling date range.

For the problem I face I need grouping not only by date, but also by country. Please consider the following input table:

Date Country Email
1/1/12 DE de1@example.com
1/1/12 FRA fra1@example.com
1/1/12 SPA spa1@example.com
1/2/12 DE de1@example.com
1/2/12 DE de2@example.com
1/3/12 SPA spa1@example.com
1/3/12 SPA spa2@example.com
1/3/12 FRA fra2@example.com
1/4/12 SPA spa1@example.com
1/4/12 FRA fra2@example.com
1/4/12 FRA fra3@example.com
1/4/12 SPA spa3@example.com

The expect outcome with counted distinct emails will be the following:

Date Country Email
1/1/12 DE 1
1/1/12 FRA 1
1/1/12 SPA 1
1/2/12 DE 2
1/2/12 FRA 1
1/2/12 SPA 1
1/3/12 SPA 2
1/3/12 DE 2
1/3/12 FRA 2
1/4/12 SPA 3
1/4/12 FRA 2
1/4/12 DE 2

I was trying to modify the solution suggested in the above-mentioned discussion and modify the following by adding country in selected columns and in grouping.

SELECT date
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  date BETWEEN g.date - 2 AND g.date
      ) AS dist_emails
FROM  (SELECT generate_series(timestamp '2012-01-01'
                            , timestamp '2012-01-06'
                            , interval  '1 day')::date) AS g(date)

Unfortunately the updated query does not work, as country is not recognised and it errors.

SELECT date, country,
         ,(SELECT count(DISTINCT email)
           FROM   tbl
           WHERE  date BETWEEN g.date - 2 AND g.date
          ) AS dist_emails
    FROM  (SELECT generate_series(timestamp '2012-01-01'
                                , timestamp '2012-01-06'
                                , interval  '1 day')::date) AS g(date)
GROUP BY 1,2

Will be very grateful for your advice and sharing your expertise on the ways this can be fixed.

astentx
  • 6,393
  • 2
  • 16
  • 25
Alex
  • 13
  • 6
  • @GordonLinoff thanks and sorry for inconvenience – Alex Sep 14 '21 at 21:27
  • Please post the exact error message you're getting with your SQL – melcher Sep 14 '21 at 22:10
  • SQL compilation error: error line1 at position 17 invalid identifier 'country'. The issue happens in Snowflake. Unfortunately I can not edit the question and add the relevant tag. Admins could you please help with this? Many thanks in advance. – Alex Sep 14 '21 at 22:17

1 Answers1

0

The reason that country doesn't exist is that we can only SELECT columns from the tables that are in a FROM section of a query. The nested subquery selects from tbl but that is not available to the main, outer query, which only selects from the generated table g. g only has a date column, so that's the only column the outer query can select on directly.

Another issue with the query is that the COUNT is not factoring in countries.

For this I'd use basic join to query every row for every date range, then do a count distinct for every date + country. You can use an INNER join to remove days that have no entries or LEFT OUTER to return rows of {date}, nil,nil if there are no entries for that date range. Something like:

SELECT g.date
     , tbl.country
     , COUNT(DISTINCT(tbl.email))
    FROM  (SELECT generate_series(timestamp '2012-01-01'
                                , timestamp '2012-01-06'
                                , interval  '1 day')::date) AS g(date)
    INNER JOIN tbl ON (tbl.date BETWEEN g.date - 2 AND g.date)

GROUP BY 1,2
melcher
  • 1,543
  • 9
  • 15