1

I am trying to get the following query on Google Merchandise Store public dataset in BigQuery:

  1. Date
  2. Number of distinct users
  3. Running sum of the number of distinct users in the last 30 days

For eg (I used 3 days in the example for simplicity):

  date        distinct_users     distinct_users_3days
15/07/2018          8                    15
14/07/2018          2                    12
13/07/2018          5                    20
12/07/2018          5                    15
11/07/2018         10                    10
   ...

This is my current SQL code which gets the first two columns, but I can't figure out how to get the running sum:

SELECT
  date,
  COUNT(DISTINCT(fullVisitorId)) as daily_active_user
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_2017*`
WHERE
  _table_suffix BETWEEN "0101"
  AND "0715"
GROUP BY
  date

Any help is appreciated! :)

Cedric
  • 209
  • 4
  • 13
  • 2
    Both answers have quite conceptual issue! Hint - check difference between using ROWS. And RANGE in OVER. Let know if you will have a problem "solving" this so I or someone else will help further – Mikhail Berlyant Jul 18 '18 at 04:56
  • Hi @Mikhail, went to find out more about RANGE. From what I understand, "The RANGE clause limits the frame to contain rows that have its value within the specified range, relative to the current value.", does this mean we should use RANGE(..BETWEEN 29 PRECEDING AND CURRENT ROW) instead? source: https://sonra.io/2017/08/22/window-function-rows-and-range-on-redshift-and-bigquery/ – Cedric Jul 18 '18 at 06:25
  • you are close. but you still need to learn a little. hint - range operates on values not on rows. the importance of using range vs rows in your example - is if you have missing any day you will get wrong result, like 31 days instead of 30 days, and so on. – Mikhail Berlyant Jul 18 '18 at 06:30
  • these are few relevant examples for you - [Count unique ids in a rolling time frame](https://stackoverflow.com/questions/47658874/count-unique-ids-in-a-rolling-time-frame/47659590#47659590) and [aggregate QUERY - 3 month average per person for every month](https://stackoverflow.com/questions/47047859/aggregate-query-3-month-average-per-person-for-every-month/47048619#47048619) - feel free to vote them up if they will help :o) – Mikhail Berlyant Jul 18 '18 at 06:43

2 Answers2

2

Please try the following query for 3 days (SQL server 2014 )-:

SELECT date,COUNT(DISTINCT(fullVisitorId)) as daily_active_user,sum(COUNT(DISTINCT(fullVisitorId))) over (PARTITION BY null ORDER BY date desc ROWS 
BETWEEN CURRENT ROW AND 2 FOLLOWING) AS distinct_users_3days FROM YOUR_TABLE_NAME WHERE _table_suffix BETWEEN '0101' AND '715' GROUP BY date

For 30 days-:

SELECT 
date,COUNT(DISTINCT(fullVisitorId)) as daily_active_user,
sum(COUNT(DISTINCT(fullVisitorId))) over (PARTITION BY null ORDER BY date desc ROWS 
BETWEEN CURRENT ROW AND 29 FOLLOWING) AS distinct_users_3days 
FROM YOUR_TABLE_NAME 
WHERE _table_suffix 
BETWEEN '0101' AND '715' 
GROUP BY date
IShubh
  • 354
  • 1
  • 3
  • 12
  • Hi @IShubh, tried your query on BigQuery and it works perfectly! I've also posted my query as I figured out the answer to my own question and posted it here for future reference. Thanks for your help! :) – Cedric Jul 17 '18 at 14:08
2

I managed to figure out the answer to my question so I would like to share with the others who may encounter this problem in future.

The SQL code is:

SELECT
  date,
  COUNT(DISTINCT(fullVisitorId)) as daily_active_user,
  SUM(count(Distinct(fullVisitorId))) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS monthly_active_user
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_2017*`,
  unnest(hits) as h
WHERE
  _table_suffix BETWEEN "0101" AND "0715"
GROUP BY
  date

This gives a column which sums the distinct users in a 30 day window.

Cedric
  • 209
  • 4
  • 13