12

Suppose I have a given time range. For explanation, let's consider something simple, like whole year 2018. I want to query data from ClickHouse as a sum aggregation for each quarter so the result should be 4 rows.

The problem is that I have data for only two quarters so when using GROUP BY quarter, only two rows are returned.

SELECT
     toStartOfQuarter(created_at) AS time,
     sum(metric) metric
 FROM mytable
 WHERE
     created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200)
    AND
     created_at <= toDate(1546210800) AND created_at <= toDateTime(1546210800)
 GROUP BY time
 ORDER BY time

15147612002018-01-01
15462108002018-12-31

This returns:

time       metric
2018-01-01 345
2018-04-01 123

And I need:

time       metric
2018-01-01 345
2018-04-01 123
2018-07-01 0
2018-10-01 0

This is simplified example but in real use case the aggregation would be eg. 5 minutes instead of quarters and GROUP BY would have at least one more attribute like GROUP BY attribute1, time so desired result is

time        metric  attribute1
2018-01-01  345     1
2018-01-01  345     2
2018-04-01  123     1
2018-04-01  123     2
2018-07-01  0       1
2018-07-01  0       2
2018-10-01  0       1
2018-10-01  0       2

Is there a way to somehow fill the whole given interval? Like InfluxDB has fill argument for group or TimescaleDb's time_bucket() function with generate_series() I tried to search ClickHouse documentation and github issues and it seems this is not implemented yet so the question perhaps is whether there's any workaround.

simPod
  • 11,498
  • 17
  • 86
  • 139
  • 1
    'WITH FILL' can be used since Clickhouse 19.14. Unfortunately, this seems to work properly only in simple case without additional group by some other attribute. – VitalyZ May 24 '20 at 21:16

4 Answers4

8

From ClickHouse 19.14 you can use the WITH FILL clause. It can fill quarters in this way:

WITH
    (
        SELECT toRelativeQuarterNum(toDate('1970-01-01'))
    ) AS init
SELECT
    -- build the date from the relative quarter number
    toDate('1970-01-01') + toIntervalQuarter(q - init) AS time,
    metric
FROM
(
    SELECT
        toRelativeQuarterNum(created_at) AS q,
        sum(rand()) AS metric
    FROM
    (
        -- generate some dates and metrics values with gaps
        SELECT toDate(arrayJoin(range(1514761200, 1546210800, ((60 * 60) * 24) * 180))) AS created_at
    )
    GROUP BY q
    ORDER BY q ASC WITH FILL FROM toRelativeQuarterNum(toDate(1514761200)) TO toRelativeQuarterNum(toDate(1546210800)) STEP 1
)

┌───────time─┬─────metric─┐
│ 2018-01-01 │ 2950782089 │
│ 2018-04-01 │ 2972073797 │
│ 2018-07-01 │          0 │
│ 2018-10-01 │  179581958 │
└────────────┴────────────┘
rubik
  • 8,814
  • 9
  • 58
  • 88
alrocar
  • 178
  • 1
  • 6
  • Incase it isn't clear, the range() input values are epoch time. Use https://www.epochconverter.com/ for easy conversion. – Hengjie Apr 20 '21 at 06:17
5

You can generate zero values using the "number" function. Then join your query and zero values using UNION ALL and already according to the obtained data we make a GROUP BY.

So, your query will look like:

SELECT SUM(metric),
       time
  FROM (
        SELECT toStartOfQuarter(toDate(1514761200+number*30*24*3600))  time,
               toUInt16(0) AS metric
          FROM numbers(30)

     UNION ALL 

          SELECT toStartOfQuarter(created_at) AS time,
               metric
          FROM mytable
         WHERE created_at >= toDate(1514761200)
           AND created_at >= toDateTime(1514761200)
           AND created_at <= toDate(1546210800)
           AND created_at <= toDateTime(1546210800)
       )
 GROUP BY time
 ORDER BY time

note toUInt16(0) - zero values must be of the same type as metrics

mikhail
  • 482
  • 1
  • 3
  • 14
  • Didn't know about `number` function, can be confortably used for generating series, thanks! But any way how to do that for unknown number of attributes? – simPod May 24 '18 at 16:21
  • In your case, you can count the number of quarters that are in your period, this is what you need a number for – mikhail May 28 '18 at 06:48
  • Yup, that can be done dynamically by deriving it from timerange but I meant the `attribute1` mentioned in Q. Anyway, meanwhile I kinda managed to make it work by using CROSS JOIN by joining subquery giving distinct `attribute1` values . Only thing that is missing now is support for CTE in postgres. Thanks! – simPod May 28 '18 at 09:01
2

As an alternative for numbers() function in some cases range and array functions can be useful.

Example: for each pair of (id1,id2) dates from the previous 7 days should be generated.

SELECT
  id1,
  id2,
  arrayJoin(
    arrayMap( x -> today() - 7 + x, range(7) )
  ) as date2
FROM table
WHERE date >= now() - 7
GROUP BY id1, id2

The result of that select can be used in UNION ALL to fill the 'holes' in data.

SELECT id1, id2, date, sum(column1)
FROM (
  SELECT
    id1,
    id2,
    date,
    column1 
  FROM table
  WHERE date >= now() - 7

  UNION ALL 

  SELECT
    id1,
    id2,
    arrayJoin(
      arrayMap( x -> today() - 7 + x, range(7) )
    ) as date2,
    0 as column1
  FROM table
  WHERE date >= now() - 7
  GROUP BY id1, id2
)
GROUP BY id1, id2, date
ORDER BY date, id1, id2
filimonov
  • 1,666
  • 1
  • 9
  • 20
2

Here is how I did it for hour buckets (needed to visualize this in Grafana) thanks to @filimonov and @mikhail

SELECT t, SUM(metric) as metric FROM (
    SELECT 
        arrayJoin(
          arrayMap( x -> toStartOfHour(addHours(toDateTime($from),x)),
              range(toUInt64(
                  dateDiff('hour', 
                      toDateTime($from), 
                      toDateTime($to)) + 1)))
        ) as t,
        0 as metric

    UNION ALL

    SELECT
        toStartOfHour(my_date) as t,
        COUNT(metric)
        FROM my_table
        WHERE t BETWEEN toDateTime($from) AND toDateTime($to)
        GROUP BY t
)
GROUP BY t ORDER BY t

So, for example for range from 2019-01-01 to 2019-01-02 it will give you:

SELECT t, SUM(metric) as metric FROM (
    SELECT 
        arrayJoin(
          arrayMap( x -> toStartOfHour(addHours(toDateTime('2019-01-01 00:00:00'),x)),
              range(toUInt64(
                  dateDiff('hour', 
                      toDateTime('2019-01-01 00:00:00'), 
                      toDateTime('2019-01-02 00:00:00')) + 1)))
        ) as t,
        0 as metric

    UNION ALL

    SELECT
        toStartOfHour(my_date) as t,
        COUNT(1) as metric
        FROM my_table
        WHERE t BETWEEN toDateTime('2019-01-01 00:00:00') AND toDateTime('2019-01-02 00:00:00')
        GROUP BY t
)
GROUP BY t ORDER BY t;
t                  |metric|
-------------------|------|
2019-01-01 00:00:00|     0|
2019-01-01 01:00:00|     0|
2019-01-01 02:00:00|     0|
2019-01-01 03:00:00|     0|
2019-01-01 04:00:00|     0|
2019-01-01 05:00:00|     0|
2019-01-01 06:00:00|     0|
2019-01-01 07:00:00|105702|
2019-01-01 08:00:00|113315|
2019-01-01 09:00:00|149837|
2019-01-01 10:00:00|185314|
2019-01-01 11:00:00|246106|
2019-01-01 12:00:00|323036|
2019-01-01 13:00:00|     0|
2019-01-01 14:00:00|409160|
2019-01-01 15:00:00|379113|
2019-01-01 16:00:00|256634|
2019-01-01 17:00:00|286601|
2019-01-01 18:00:00|280039|
2019-01-01 19:00:00|248504|
2019-01-01 20:00:00|218642|
2019-01-01 21:00:00|186152|
2019-01-01 22:00:00|148478|
2019-01-01 23:00:00|109721|
2019-01-02 00:00:00|     0|
VitalyZ
  • 445
  • 1
  • 4
  • 12