3

I'm trying to work out how to create a solution that will allow me to query a table that has a timestamp, and in return get a time series data. The request consists of start/end date & time, granularity type (minute, hour, day, week, month and year) and granularity value. Having tried to use in a query something like

GROUP BY ROUND(UNIX_TIMESTAMP(created_at) DIV 60)

to get the results per one minute, or DIV 300 for every five minutes is fine. The problem lies further up for calculating months and years' seconds which will be inaccurate. I've stumbled upon the generate_series in PGSQL (MySQL alternative) and am stuck trying to tie them together. How do I calculate a count of rows, for example, for two days, on a 15 minute granularity? It's a complex question that I'll probably have to break down further.

I have already visited #1 and #2, but they are incomplete. To me it seems that rounding will only be allowed to certain level and I'd have to restrict it (i.e .for 2 months period there cannot be hourly breakdown).

EDIT

It gave me the wrong impression - I would not have to calculate monthly figures based on seconds using the query like:

SELECT DATE_FORMAT(MIN(created_at),'%d/%m/%Y %H:%i:%s' as date,
COUNT(*) AS count FROM guests
GROUP BY ROUND(UNIX_TIMESTAMP(created_at) / 300)

It's only going to do grouping based on minimum value. But the question still stands - is the best approach really to go through the time period using granularity value and "slice" the data that way without loosing too much accuracy?

It seems that the only approach is to run sub-queries for a set of data (i.e. for a period of two months, generate 15 minute intervals timestamps, group the data into them and produce an aggregate) without dividing the original timestamp to produce the rounded approximation.

miken32
  • 42,008
  • 16
  • 111
  • 154
Daniel Protopopov
  • 6,778
  • 3
  • 23
  • 39
  • Can you please include some sample data in an sqlfiddle or something, then express your desired output. Context improves clarity/understanding. – mickmackusa Nov 16 '18 at 12:53
  • The data is stored through timestamps, it would be difficult to provide with proper example as there may be thousands of records. I'll edit the question with possible examples. – Daniel Protopopov Nov 16 '18 at 12:56
  • here is a basic structure of a MySQL number generator what you can use a subquery https://www.db-fiddle.com/f/mVJSpFHja4unD4fAsihADv/0 .. There are some out there but using a MySQL user variable is the most easy.. MySQL 8.0 is even more easy because it's supports recursive CTE's which can also be used to generate a number list... generating a date serie with it https://www.db-fiddle.com/f/mVJSpFHja4unD4fAsihADv/3 – Raymond Nijland Nov 16 '18 at 13:09
  • @RaymondNijland, the question is not how to generate the time series, but how to use it in conjunction with data that has timestamps to produce accurate time series data. – Daniel Protopopov Nov 16 '18 at 13:10
  • "the question is not how to generate the time series, but how to use it in conjunction with data that has timestamps to produce accurate time series data." Ok in that case read "Why should I provide an MCVE for what seems to me to be a very simple SQL query?" https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query – Raymond Nijland Nov 16 '18 at 13:12
  • I suppose the question was more for checking if there's another way of producing desired outcome without using sub-queries. You're right, @RaymondNijland, there's no need to produce MCVE there, it's pretty straightforward, this was an exploration of the topic. – Daniel Protopopov Nov 16 '18 at 13:18
  • Besides what is the used MySQL version? "I suppose the question was more for checking if there's another way of producing desired outcome without using sub-queries" subqeuries arn't evil things to avoid by the way some use them wrong so that MySQL loses performance.. Besides if you want to tie the table time slots with the generate series i assume you need to use SQL's `LEAD(column, 1)` to see the next generate time serie record.. You might need to simulate that it your MySQL version does not support it.. – Raymond Nijland Nov 16 '18 at 13:22
  • Having faced this problem a while ago, I wrote a couple of blogs about it. https://www.plumislandmedia.net/mysql/filling-missing-data-sequences-cardinal-integers/ and https://www.plumislandmedia.net/mysql/sql-reporting-time-intervals/ – O. Jones Nov 16 '18 at 13:25
  • Thank you @O.Jones, but I don't think I'll be able to apply your examples there with 5-10-15-30 minute intervals, will have to do a bit more thinking :) – Daniel Protopopov Nov 16 '18 at 13:34

1 Answers1

2

Let's say you have a gigantic table measure with two columns datestamp and temp.

Let's say you want to see the temperature every six minutes (10x per hour) for the last week. You can do this sort of thing. We'll get to defining trunc in a moment.

  SELECT trunc(datestamp) datestamp, AVG(temp) temp
    FROM measure
   WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
  GROUP BY trunc(datestamp)
  ORDER BY trunc(datestamp)

That works for any reasonable definition of trunc. In this case trunc(t) returns the beginning of the six-minute period in which t occurs. So, trunc('1942-12-07 08:45:17') gives 1942-12-07 08:42:00).

Here's a query that works for every six minute interval.

  SELECT DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
            INTERVAL (MINUTE(datestamp) -
                      MINUTE(datestamp) MOD 6) datestamp, 
         AVG(temp) temp
    FROM measure
   WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
   GROUP BY DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
            INTERVAL (MINUTE(datestamp) -
                      MINUTE(datestamp) MOD 6)
  ORDER BY 1

This uses inbuilt date arithmetic rather than unix timestamp arithmetic.

You can use a stored function to make this easier to read.

DELIMITER $$
DROP FUNCTION IF EXISTS TRUNC_N_MINUTES$$
CREATE
  FUNCTION TRUNC_N_MINUTES(datestamp DATETIME, n INT)
  RETURNS DATETIME DETERMINISTIC NO SQL
  COMMENT 'truncate to N minute boundary. For example,
           TRUNCATE_N_MINUTES(sometime, 15) gives the nearest
           preceding quarter hour'
  RETURN DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
                INTERVAL (MINUTE(datestamp) -
                          MINUTE(datestamp) MOD n) MINUTE$$
DELIMITER ;

Then your query will say

  SELECT TRUNC_N_MINUTES(datestamp, 6) datestamp, AVG(temp) temp
    FROM measure
   WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
  GROUP BY TRUNC_N_MINUTES(datestamp, 6)
  ORDER BY TRUNC_N_MINUTES(datestamp, 6)

If you want to summarize by 5, 10, 15, or minute boundaries (three items per hour) simply use that number in place of 6.

You'll need different trunc() functions for hours, etc.

The trunc() function for daily summaries is DATE(datestamp). For monthly summaries it is LAST_DAY(datestamp). For example,

  SELECT LAST_DAY(datestamp) month_ending, AVG(temp) temp
    FROM measure
  GROUP BY LAST_DAY(datestamp) 
  ORDER BY LAST_DAY(datestamp) 

yields a month-by-month summary.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • I’ll give it a try, however my main concern is due to inconsistent number of days (and lower timeframes) that’s done when doing any calculations with the time stamp. Because of that I tend to believe that records with time level granularity (and up to week) can be done with the time stamp, while date based ones (day and above) have to be done with grouping by those metrics and calculating aggregate of that. – Daniel Protopopov Nov 16 '18 at 19:08
  • It's certainly true that grouping by month has the flaw that months are different lengths. If your data are `TIMESTAMPS` and you are in a jurisdiction with daylight saving time, you'll get anomalous results on the changeover days. – O. Jones Nov 16 '18 at 20:52