0

Given I have multiple tables in BigQuery, hence I have multiple SQL-statements that gives me "the number of X per day". For example:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as installs
FROM database.table1
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | installs |
-------------------------
| 2017-01-01 | 11       |
| 2017-01-02 | 22       |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as uninstalls
FROM database.table2
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | uninstalls |
---------------------------
| 2017-01-02 | 22         |
| 2017-01-03 | 33         |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as cases
FROM database.table3
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | cases |
----------------------
| 2017-01-01 | 11    |
| 2017-01-03 | 33    |
etc

etc

Now I need to combine all these into a single SELECT statement that gives the following results:

| day        | installs | uninstalls | cases |
----------------------------------------------
| 2017-01-01 | 11       | 0          | 11    |
| 2017-01-02 | 22       | 22         | 0     |
| 2017-01-03 | 0        | 33         | 33    |
etc

Is this even possible?

Or what's the closest SQL-statement I can write that would give me a similar result?

Any feedback is appreciated!

corgrath
  • 11,673
  • 15
  • 68
  • 99
  • Yes thats entirely possible, simply join them together on `timestamp`. By the way, I hope in reality you have more functional tablenames, and it is not a good idea to name an attribute `timestamp` as that is a keyword. – HoneyBadger Sep 20 '17 at 13:10
  • Join by timestamp? The timestamp is actually a timestamp tho, down to ms, so I am not sure if joining on timestamp is the best idea? – corgrath Sep 20 '17 at 13:17
  • Well, just join it by `FORMAT_TIMESTAMP("%F",timestamp)` then – HoneyBadger Sep 20 '17 at 13:19
  • I am unsure how the JOIN syntax should be. – corgrath Sep 20 '17 at 13:28
  • 1
    This is no efficient way to do this at Big Query side. JOIN on Big Query table is very heavy operation and will cost you more. and even takes a lot time to do so. Better move it away in java layer or avoid to do so. – Jack Sep 27 '17 at 10:30

3 Answers3

2

Here is a self-contained example that might help to get you started. It uses two dummy tables, InstallEvents and UninstallEvents, which contain timestamps for the respective actions. It creates a common table expression called StartAndEnd that computes the minimum and maximum dates for these events in order to decide which dates to aggregate over, then unions the contents of the InstallEvents and UninstallEvents, counting the events for each day.

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
),
StartAndEnd AS (
  SELECT MIN(DATE(timestamp)) AS min_date, MAX(DATE(timestamp)) AS max_date
  FROM (
    SELECT * FROM InstallEvents UNION ALL
    SELECT * FROM UninstallEvents
  )
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(
    (SELECT min_date FROM StartAndEnd),
    (SELECT max_date FROM StartAndEnd)
  )) AS day
GROUP BY day
ORDER BY day;

If you know what the start and end dates are in advance, you can hard-code them in the query instead and then omit the StartAndEnd CTE:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-04')) AS day
GROUP BY day
ORDER BY day;

To see the events in the sample data, use a query that unions the contents:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT timestamp, true AS is_install
FROM InstallEvents UNION ALL
SELECT timestamp, false
FROM UninstallEvents;
Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
1

Below is for BigQuery Standard SQL

#standardSQL
WITH calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
-- ORDER BY day  

Please note: you are using timestamp as a column name which is not the best practice as it is keyword, so in my example i leave your naming but consider to change this!

You can test / play this solution with below dummy data

#standardSQL
WITH `database.table1` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  
  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 
),
`database.table2` AS (
  SELECT TIMESTAMP '2016-12-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
`database.table3` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-10', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
ORDER BY day
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

I am not very familiar with bigquery, so this is probably not going to be a copy-paste answer.

You'll first have to build a calander table to make sure you have all dates. Here's an example for sql server. There are probably examples for bigquery available as well. The following assumes a Calander table with Date attribute in timestamp.

Once you have your calander table you can join all your tables to that:

SELECT      FORMAT_TIMESTAMP("%F",C.Date) AS day
,           COUNT(T1.DATE(T1.TIMESTAMP)) AS installs --Here you could also use your FORMAT_TIMESTAMP
,           COUNT(T1.DATE(T2.TIMESTAMP)) AS uninstalls
FROM        Calander C
LEFT JOIN   database.table1 T1
        ON  DATE(T1.TIMESTAMP) = DATE(C.Date) --Convert to date to remove times, you could also use your FORMAT_TIMESTAMP
LEFT JOIN   database.table2 T2
        ON  DATE(T2.TIMESTAMP) = DATE(C.Date)
GROUP BY    day
ORDER BY    day ASC
HoneyBadger
  • 14,750
  • 3
  • 34
  • 48
  • Thanks for your feedback and I wished it was so easy. However, as you can see in my question, each "sub-select" has a order by, etc. Here is a real example of a left join I have tried https://pastebin.com/B6usyvpf but it complains on that "day" is unknown. Good attempt though. – corgrath Sep 20 '17 at 13:55
  • The error about unknown day is because it doesn't know the alias yet. So you'll have to be fully explicit: `ON acq_adult_day = FORMAT_TIMESTAMP("%F",timestamp)` – HoneyBadger Sep 20 '17 at 14:34