I am working on a report that will give a list of missing sequences using imported data:
CREATE TABLE `client_trans`
(
`id` INT NOT NULL AUTO_INCREMENT,
`client_id` INT NULL,
`sequence` INT NULL,
`other_data` INT NULL,
PRIMARY KEY (`id`),
INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC)
);
Except for the id field there are no truly unique values or even combination of values
The data for this table looks as follows(ignoring other_data fields):
id client_id sequence
1 1000 1
2 1000 2
3 1000 2
4 1000 3
5 1001 1
6 1001 5
7 1001 6
8 1002 4
9 1002 6
As in the example above, it is possible for multiple client_id/sequence combinations and it is possible that a sequence does not start at 1 (nor 0)
While running a query to find missing sequences is possible, such as a variation on the answer to this question, this can take extremely long
An alternative to this method was to perform some insert/update queries before or while the data is being inserted into the table (using Pentaho data-integration tools) and using an additional table containing missing client_id/sequence values. This means that in the above example, when inserting (client_id, sequence) values (1001, 5), it would be picked up that sequences 2-4 are missing using something like the query I figured out below:
CREATE TABLE `missing_sequences` (
`client_id` int(11),
`miss_start` int(11),
`miss_end` int(11),
)
(Note that for easier testing the query in in SQL instead of Pentaho Execute SQL statements and the insert is commented out so that it is just a select)
SET @temp_id = 1001;
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
AND t1.seq < @temp_seq
AND t2.id = @temp_id
AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start
Result:
id missing_start missing_end
1001 2 4
This will populate the missing sequence table somewhat successfully, but the problem comes in when a row gets added that contains one of the previously missing sequences.
(Initially I also had the primary index based on client_id and miss_start, which would also have taken care of duplicate values added, but wasn't entirely sure whether this would be correct)
One of four possibilities exist depending of the sequence number inserted, with examples:
@temp_seq = missing_start : (@temp_seq = 2)
update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
split into two records
@temp_seq = missing_end : (@temp_seq = 4)
update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
delete record from missing_sequences table
This is now where my problem comes in (earlier if you take into account that imported data may not be sorted):
How do I cater for each possibility as well as the initial insert as well as duplicates in a Pentaho data integration transformation?
Edit: After some brainstorming, I came up with the following script that seems to be working when running it in MySQL, but not when running it as an 'Execute SQL statements' trigger. This is with a primary index on the missing_sequences table of (client_id, missing_start):
SET @orig_start = 0;
SET @orig_end = 0;
SET @temp_client_id = ?;
SET @temp_sequence = ?;
/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences
WHERE client_id = @temp_client_id
AND miss_start <= @temp_sequence
AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */
/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;
/* Insert new value. This will insert the FIRST value in the table
eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
AND t1.sequence < @temp_sequence
AND t2.client_id = @temp_client_id
AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
Execute for each row and Variable substitution boxes are checked, but execution seems to be inconsistent or not updating the missing sequences table at all