Suppose for simplicity that I have the following table in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`value` text,
`value_valid` tinyint(1) NOT NULL,
PRIMARY KEY (`pv_name`,`time_stamp`),
) ENGINE=InnoDB;
I am trying to find the most efficient query to implement the equivalent of the following:
Given a pair of time stamps t0
and t1
:
For each pv_name
:
Get the value from the row with this
pv_name
and the largesttime_stamp <= t0
(if one exists). This is the value of the process variable at the beginning of the time range. If this value is not valid then discard it.Get the set of values from the rows with this
pv_name
and atime_stamp
in(t0, t1)
that are valid (if any exist).If there is more than one distinct value among the combined set of values from 1 and 2 then return the pv_name.
In essence I am trying to find which process variables had a change in value in the given time range, including a change from the value that it had at the beginning of the time range.
There are on the order of billions of rows in the table and it will continue to grow. There are on the order of 100,000 distinct pv_names in the table and they will remain fairly static. The vast majority of adjacent values (ordered by time_stamp for each pv_name) are expected to be distinct.
EDIT
If I was going to implement this from scratch I would do the following: The set of pv_names would be stored in a trie. The value for each pv_name in the trie would be a link to a binary search tree. The binary search tree would store key, value pairs of (time_stamp, value). The value in each of these pairs would be the value of the pv_name at the corresponding time_stamp.
To find out which pv_names had a change in value for a given time_range (t0, t1) I would do the following: Iterate through each pv_name in the trie and follow the link to its binary search tree. Find the greatest time_stamp in this tree less than or equal to t0. If none exists, find the smallest time_stamp in this tree less than t1. If none of these exist go to the next pv_name in the trie. Otherwise, iterate through the time_stamps in increasing order comparing the value associated with the current time_stamp to the value associated with the previous. If they differ, print out the pv_name. Stop iterating through the time_stamps. Go to the next pv_name in the trie and repeat. If a time_stamp greater than or equal to t1 is arrived at, and no differences have been found, then go to the next pv_name in the trie and repeat. Do not use the value for the time_stamp t1 in the comparisons.
Simplified example:
pv_name | time_stamp | value
A | 1.0 | 1.15
B | 2.0 | 1.00
A | 3.0 | 1.12
B | 4.0 | 1.00
A | 5.0 | 1.00
B | 6.0 | 1.00
A | 7.0 | 3.15
B | 8.0 | 9.13
A | 9.0 | 4.30
B | 10.0 | 1.00
A | 11.0 | 9.00
B | 12.0 | 1.00
time range | values of A | values of B | result
(0.0,0.5) | NULL | NULL | NULL
(1.5,2.0) | 1.15 | NULL | NULL
(1.5,5.0) | 1.15, 1.12 | NULL, 1.00, 1.00 | A
(4.0,9.0) | 1.12, 1.00, 3.15 | 1.00, 1.00, 9.13 | A, B
(13.0,14.0) | 9.00 | 1.00 | NULL
Can I do the equivalent with the same or better efficiency in MySQL or another database, relational or otherwise?