3

I have a table that looks something like the following - essentially containing a timestamp as well as some other columns:

WeatherTable
+---------------------+---------+----------------+      +
| TS                  | MonthET | InsideHumidity | .... |
+---------------------+---------+----------------+      |
| 2014-10-27 14:24:22 |       0 |             54 |      |
| 2014-10-27 14:24:24 |       0 |             54 |      |
| 2014-10-27 14:24:26 |       0 |             52 |      |
| 2014-10-27 14:24:28 |       0 |             54 |      |
| 2014-10-27 14:24:30 |       0 |             53 |      |
| 2014-10-27 14:24:32 |       0 |             55 |      |
| 2014-10-27 14:24:34 |       9 |             54 |      |
.......

I'm trying to formulate a SQL query that returns all rows within a certain timeframe (no problem here) with a certain arbitrary granularity, for instance, every 15 seconds. The number is always specified in seconds but is not limited to values less than 60. To complicate things further, the timestamps don't necessarily fall on the granularity required, so it's not a case of simply selecting the timestamp of 14:24:00, 14:24:15, 14:24:30, etc. - the row with the closest timestamp to each value needs to be included in the result.

For example, if the starting time was given as 14:24:30, the end time as 14:32:00, and the granularity was 130, the ideal times would be:

14:24:30
14:26:40
14:28:50
14:31:00

However, timestamps may not exist for each of those times, in which case the row with the closest timestamp to each of those ideal timestamps should be selected. In the case of two timestamps which are equally far away from the ideal timestamp, the earlier one should be selected.

The database is part of a web service, so presently I'm just ignoring the granularity in the SQL query and filtering the unwanted results out in (Java) code later. However, this seems far from ideal in terms of memory consumption and performance.

Any ideas?

Michael Berry
  • 70,193
  • 21
  • 157
  • 216
  • first you need to establish the rule to show absent values. For instance, suppose you need a value for InsideHumidity on 2014-10-27 14:24:29. You will said it is 54, 53 or 53.5? – Horaciux Nov 02 '14 at 23:27
  • Is the arbitrary granularity always in seconds which are <=60? And, if so, if we have 43, then does 43, 83, 126 all count? Or, just every 43? –  Nov 02 '14 at 23:31
  • That rule could be, last known value, weighted average, etc. Also, you need to know why you don't have a value for this timestam. Was it a difference in sampling frequency, value within a dead-band, comunication error, field equipment failure, etc. – Horaciux Nov 02 '14 at 23:35
  • @JeremyMiller Always specified in seconds, but can be greater than 60. Whatever "counts" is whatever is closest to the ideal values - I've updated the question to (hopefully) make this clearer. – Michael Berry Nov 02 '14 at 23:40
  • @Horaciux Hopefully the update has made it clearer, but the rule is the closest timestamp to the ideal timestamp (biased towards the earlier value in case of an equal offset.) The data is from a weather station which *usually* sends back packets every two seconds, but in certain cases (communication error, power failure, etc.) will skip some values. – Michael Berry Nov 02 '14 at 23:41
  • 1
    Thanks. I think I'm getting close to a mathematical solution. Will update soon. –  Nov 03 '14 at 00:19
  • Sorry, but I'll have to pick up on this later (hopefully not before it gets closed by a mod like my last question that asked for a better solution). The idea is to "bin" the times using the modulous operator and then sort by the bin, then time to get your results. –  Nov 03 '14 at 01:10

2 Answers2

3

You could try to do it like this:

Create a list of time_intervals first. Using the stored procedure make_intervals from Get a list of dates between two dates create a temporary tables calling it somehow like that:

call make_intervals(@startdate,@enddate,15,'SECOND');

You will then have a table time_intervals with one of two columns named interval_start. Use this to find the closest Timestamp to each interval somehow like that:

CREATE TEMPORARY TABLE IF NOT EXISTS time_intervals_copy
  AS (SELECT * FROM time_intervals);

SELECT
  time_intervals.interval_start,
  WeatherTable.*
FROM time_intervals
JOIN WeatherTable
  ON WeatherTable.TS BETWEEN @startdate AND @enddate
JOIN (SELECT
        time_intervals.interval_start AS interval_start,
        MIN(ABS(time_intervals.interval_start - WeatherTable.TS)) AS ts_diff
      FROM time_intervals_copy AS time_intervals
      JOIN WeatherTable
      WHERE WeatherTable.TS BETWEEN @startdate AND @enddate
      GROUP BY time_intervals.interval_start) AS min
  ON min.interval_start = time_intervals.interval_start AND
     ABS(time_intervals.interval_start - WeatherTable.TS) = min.ts_diff
GROUP BY time_intervals.interval_start;

This will find the closest timestamp to every time_interval. Note: Each row in WeatherTable could be listed more than once, if the interval used is less than half the interval of the stored data (or something like that, you get the point ;)).

Note: I did not test the queries, they are written from my head. Please adjust to your use-case and correct minor mistakes, that might be in there...

Community
  • 1
  • 1
wolfgangwalther
  • 1,226
  • 7
  • 15
1

For testing purposes, I extended your dataset to the following timestamps. The column in my database is called time_stamp.

2014-10-27 14:24:24
2014-10-27 14:24:26
2014-10-27 14:24:28
2014-10-27 14:24:32
2014-10-27 14:24:34
2014-10-27 14:24:25
2014-10-27 14:24:32
2014-10-27 14:24:34
2014-10-27 14:24:36
2014-10-27 14:24:37
2014-10-27 14:24:39
2014-10-27 14:24:44
2014-10-27 14:24:47
2014-10-27 14:24:53

I've summarized the idea, but let me explain in more detail before providing the solution I was able to work out.

The requirements are to address timestamps +/- a given time. Since we must go in either direction, we'll want to take the timeframe and split it in half. Then, -1/2 of the timeframe to +1/2 of the timeframe defines a "bin" to consider.

The bin for a given time from a given start time in an interval of @seconds is then given by this MySQL statement:

((floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)

NOTE: The whole + 1 trick is there so that we do not end up with bin of -1 index (it'll start at zero). All times are calculated from the start time to ensure timeframes of >=60 seconds work.

Within each bin, we will need to know the magnitude of the distance from the center of the bin for each timeframe. That's done by determining the number of seconds from start and subtracting it from the bin (then taking the absolute value).

At this stage we then have all times "binned up" and ordered within the bin.

To filter out these results, we LEFT JOIN to the same table and setup the conditions to remove the undesirable rows. When LEFT JOINed, the desirable rows will have a NULL match in the LEFT JOINed table.

I have rather hack-like replaced the start, end, and seconds with variables, but only for readability. MySQL-style comments are included in the LEFT JOIN ON clause identifying the conditions.

SET @seconds = 7;
SET @time_start = TIMESTAMP('2014-10-27 14:24:24');
SET @time_end = TIMESTAMP('2014-10-27 14:24:52');

SELECT t1.*
FROM temp t1
LEFT JOIN temp t2 ON
  #Condition 1: Only considering rows in the same "bin"
  ((floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)
 = ((floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds)
AND
(
  #Condition 2 (Part A): "Filter" by removing rows which are greater from the center of the bin than others
  abs(
      (t1.time_stamp - @time_start)
      - (floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
  )
  > 
  abs(
      (t2.time_stamp - @time_start)
      - (floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
  )
  OR
  #Condition 2 (Part B1): "Filter" by removing rows which are the same distance from the center of the bin
  (
    abs(
        (t1.time_stamp - @time_start)
        - (floor(((t1.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
    )
    =
    abs(
        (t2.time_stamp - @time_start)
        - (floor(((t2.time_stamp - @time_start) - (@seconds/2))/@seconds) + 1) * @seconds
    )
    #Condition 2 (Part B2): And are in the future from the other match
    AND
      (t1.time_stamp - @time_start)
      >
      (t2.time_stamp - @time_start)
  )
)
WHERE t1.time_stamp - @time_start >= 0
AND @time_end - t1.time_stamp >= 0
#Condition 3: All rows which have a match are undesirable, so those 
#with a NULL for the primary key (in this case temp_id) are selected
AND t2.temp_id IS NULL

There may be a more succinct way to write the query, but it did filter the results down to what was needed with one notable exception -- I purposefully put in a duplicate entry. This query will return both such entries as they do meet the criteria as stated.

  • 1
    Be sure to add some kind of explanation, when you are done - would really like to understand the concept :) – wolfgangwalther Nov 03 '14 at 02:25
  • Thanks for trying to get this working - looks like an interesting approach! – Michael Berry Nov 03 '14 at 09:10
  • 1
    Getting closer. The query above now has the first result in each bin as the correct row to select. But, time for the workday to start, so will have to pick up later. –  Nov 03 '14 at 15:48