0

Suppose for simplicity that I have the following table in MySQL:

CREATE TABLE `events` (
    `pv_name` varchar(60) NOT NULL,
    `time_stamp` bigint(20) unsigned NOT NULL,
    `value` text,
    `value_valid` tinyint(1) NOT NULL,
    PRIMARY KEY (`pv_name`,`time_stamp`),
) ENGINE=InnoDB;

I am trying to find the most efficient query to implement the equivalent of the following:


Given a pair of time stamps t0 and t1:

For each pv_name:

  1. Get the value from the row with this pv_name and the largest time_stamp <= t0 (if one exists). This is the value of the process variable at the beginning of the time range. If this value is not valid then discard it.

  2. Get the set of values from the rows with this pv_name and a time_stamp in (t0, t1) that are valid (if any exist).

    If there is more than one distinct value among the combined set of values from 1 and 2 then return the pv_name.


In essence I am trying to find which process variables had a change in value in the given time range, including a change from the value that it had at the beginning of the time range.

There are on the order of billions of rows in the table and it will continue to grow. There are on the order of 100,000 distinct pv_names in the table and they will remain fairly static. The vast majority of adjacent values (ordered by time_stamp for each pv_name) are expected to be distinct.

EDIT

If I was going to implement this from scratch I would do the following: The set of pv_names would be stored in a trie. The value for each pv_name in the trie would be a link to a binary search tree. The binary search tree would store key, value pairs of (time_stamp, value). The value in each of these pairs would be the value of the pv_name at the corresponding time_stamp.

To find out which pv_names had a change in value for a given time_range (t0, t1) I would do the following: Iterate through each pv_name in the trie and follow the link to its binary search tree. Find the greatest time_stamp in this tree less than or equal to t0. If none exists, find the smallest time_stamp in this tree less than t1. If none of these exist go to the next pv_name in the trie. Otherwise, iterate through the time_stamps in increasing order comparing the value associated with the current time_stamp to the value associated with the previous. If they differ, print out the pv_name. Stop iterating through the time_stamps. Go to the next pv_name in the trie and repeat. If a time_stamp greater than or equal to t1 is arrived at, and no differences have been found, then go to the next pv_name in the trie and repeat. Do not use the value for the time_stamp t1 in the comparisons.

Simplified example:
pv_name | time_stamp | value
A       | 1.0        | 1.15
B       | 2.0        | 1.00
A       | 3.0        | 1.12
B       | 4.0        | 1.00
A       | 5.0        | 1.00
B       | 6.0        | 1.00
A       | 7.0        | 3.15
B       | 8.0        | 9.13
A       | 9.0        | 4.30
B       | 10.0       | 1.00
A       | 11.0       | 9.00
B       | 12.0       | 1.00

time range  | values of A      | values of B           | result
(0.0,0.5)   | NULL             | NULL                  | NULL
(1.5,2.0)   | 1.15             | NULL                  | NULL
(1.5,5.0)   | 1.15, 1.12       | NULL, 1.00, 1.00      | A
(4.0,9.0)   | 1.12, 1.00, 3.15 | 1.00, 1.00, 9.13      | A, B
(13.0,14.0) | 9.00             | 1.00                  | NULL

Can I do the equivalent with the same or better efficiency in MySQL or another database, relational or otherwise?

Community
  • 1
  • 1
Patrick
  • 147
  • 1
  • 15

1 Answers1

0
SELECT pv_name
FROM (
    -- Query for step 1
    SELECT e1.pv_name, e3.value
    FROM (SELECT pv_name, MAX(time_stamp) AS start_time
          FROM events
          WHERE time_stamp <= @t0
          GROUP BY pv_name) AS e1
    JOIN events AS e3 ON e3.pv_name = e1.pv_name AND e3.time_stamp = e1.start_time
    WHERE e3.value_valid
  UNION DISTINCT
    -- Query for step 2
    SELECT DISTINCT pv_name, value
    FROM events
    WHERE time_stamp BETWEEN @t0 AND t1
    AND value_valid
) AS x
GROUP BY pv_name
HAVING COUNT(*) > 1

The first subquery in the union uses one of the techniques in SQL Select only rows with Max Value on a Column. I think it should be pretty efficient because of your primary key, but you could try one of the other techniques there.

The second subquery gets all the valild rows in the t0 - t1 time range, removing duplicate values.

I don't know if it's more efficient to do the duplicate value suppression in the subquery and UNION DISTINCT, or defer it to the end with UNION ALL and HAVING COUNT(DISTINCT value) > 1. You'll need to benchmark the two methods.

Community
  • 1
  • 1
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • I was afraid that was what you meant, it means the subquery will be a bit more complex. I don't think there's any way to get around the second problem in SQL -- there's no looping mechanism that you can break out of. as soon as you find a non-matching value. You could do it in a stored procedure that uses a cursor to step through the results, but that will probably be slow, too. – Barmar Mar 24 '16 at 02:54
  • I fixed the problem with omitting the event if the last row before `t0` is not valid. – Barmar Mar 24 '16 at 02:58
  • Will the joins be efficient if there are billions of rows or more in the `events` table? – Patrick Mar 24 '16 at 03:04
  • The total size of the table is not so important, what matters is the number of rows for each event in the `t0 - t1` time range. The primary key index should be used to find these rows efficiently. Take a look at the `EXPLAIN` output. – Barmar Mar 24 '16 at 03:06
  • This means that the efficiency will depend on how large a time range t0 to t1 is. Probably don't try doing it with a year range, but hours will probably be OK. – Barmar Mar 24 '16 at 03:08
  • I think one problem may be that if there are no rows for `e1` then `e1.start_time` will be NULL in the subsequent joins. Unfortunately I am hoping to find something that would handle a really large time range. – Patrick Mar 24 '16 at 03:16
  • If there is no row for a pv_name in `e1` we still want to search for changes for the pv_name in `(t0, t1)`. – Patrick Mar 24 '16 at 03:19
  • That would be true for an outer join. With inner joins, if there are no matching rows then the `pv_name` will be omitted entirely. – Barmar Mar 24 '16 at 03:19
  • Oh, so we need a union between the row found in step 1 and the rows in step 2. – Barmar Mar 24 '16 at 03:21
  • I think it would be ideal if there was a DISTINCT EXISTS GROUP BY aggregate function that didn't look for ALL the distinct values. Do you know if that is something that could be written as an addon, or if another database might support it? – Patrick Mar 24 '16 at 03:31
  • I wouldn't be surprised if other databases have something like that, MySQL is pretty meagre in this regard, but I don't know any specifics. As I mentioned above, you could do it in MySQL with a stored procedure that iterates through the rows and breaks out of the loop when it finds the first duplicate. You could also do it in a client language like PHP or Python. – Barmar Mar 24 '16 at 03:33
  • So I could do a group by and then use a cursor on the rows for each group without having to iterate through every row in each group? Is there any chance you could point me to an example of that? – Patrick Mar 26 '16 at 03:20
  • Don't really know anything off the top of my head. But if you google for "mysql stored procedure cursor" I'm sure you'll find something helpful. – Barmar Mar 27 '16 at 03:37