3

I am working on a report that will give a list of missing sequences using imported data:

 CREATE  TABLE `client_trans` 
 (
   `id` INT NOT NULL AUTO_INCREMENT,
   `client_id` INT NULL,
   `sequence` INT NULL,
   `other_data` INT NULL,
   PRIMARY KEY (`id`),
   INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC) 
 );

Except for the id field there are no truly unique values or even combination of values

The data for this table looks as follows(ignoring other_data fields):

id  client_id sequence
1   1000      1
2   1000      2
3   1000      2
4   1000      3
5   1001      1
6   1001      5
7   1001      6
8   1002      4
9   1002      6

As in the example above, it is possible for multiple client_id/sequence combinations and it is possible that a sequence does not start at 1 (nor 0)

While running a query to find missing sequences is possible, such as a variation on the answer to this question, this can take extremely long

An alternative to this method was to perform some insert/update queries before or while the data is being inserted into the table (using Pentaho data-integration tools) and using an additional table containing missing client_id/sequence values. This means that in the above example, when inserting (client_id, sequence) values (1001, 5), it would be picked up that sequences 2-4 are missing using something like the query I figured out below:

CREATE TABLE `missing_sequences` (
  `client_id` int(11),
  `miss_start` int(11),
  `miss_end` int(11),
) 

(Note that for easier testing the query in in SQL instead of Pentaho Execute SQL statements and the insert is commented out so that it is just a select)

SET @temp_id = 1001;    
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
  AND t1.seq < @temp_seq
  AND t2.id = @temp_id
  AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start

Result:

id       missing_start        missing_end
1001     2                    4

This will populate the missing sequence table somewhat successfully, but the problem comes in when a row gets added that contains one of the previously missing sequences.
(Initially I also had the primary index based on client_id and miss_start, which would also have taken care of duplicate values added, but wasn't entirely sure whether this would be correct)

One of four possibilities exist depending of the sequence number inserted, with examples:

@temp_seq = missing_start : (@temp_seq = 2) 
    update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
    split into two records
@temp_seq = missing_end : (@temp_seq = 4)
    update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
    delete record from missing_sequences table

This is now where my problem comes in (earlier if you take into account that imported data may not be sorted):
How do I cater for each possibility as well as the initial insert as well as duplicates in a Pentaho data integration transformation?

Edit: After some brainstorming, I came up with the following script that seems to be working when running it in MySQL, but not when running it as an 'Execute SQL statements' trigger. This is with a primary index on the missing_sequences table of (client_id, missing_start):

SET @orig_start = 0;
SET @orig_end = 0;

SET @temp_client_id = ?;
SET @temp_sequence = ?;

/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences 
WHERE client_id = @temp_client_id
  AND miss_start <= @temp_sequence
  AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */

/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;

/* Insert new value. This will insert the FIRST value in the table
   eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
  AND t1.sequence < @temp_sequence
  AND t2.client_id = @temp_client_id
  AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;

Execute for each row and Variable substitution boxes are checked, but execution seems to be inconsistent or not updating the missing sequences table at all

Community
  • 1
  • 1
MSwart
  • 31
  • 1
  • 3
  • Have you thought about sorting the input stream and then passing through an `Analytical Query` step? This will allow you operate on values from the previous and/or next rows in the sequence. – Brian.D.Myers Aug 09 '13 at 00:13
  • Using sorted data and analytical query would work if the load would just be run once, but this load is supposed to be run on a daily basis with older data already loaded. (Note that no duplicate/updated values are expected, hence table output instead of insert/update) – MSwart Aug 13 '13 at 07:09
  • Pull your existing data and union with your incoming data, then sort and do the Analytical Query step. Switch your output to an Insert/Update and set it to not do any updates. Would that work? – Brian.D.Myers Aug 13 '13 at 16:36

0 Answers0