My problem boils down to the following. I have a table with some natural sequencing, and in it I have a key value which may repeat over time. I want to find the blocks where the key is the same, then changes, and then comes back to being the same. Example:
- A
- A
- B
- B
- B
- C
- C
- A
- A
- C
- C
Here I want the result to be
- A, 1-2
- B, 3-5
- C, 6-7
- A, 8-9
- C, 10-11
so I can't use that key value A, B, C to group by, because the same key can appear multiple times, I just want to squeeze out repetitive occurrences that are uninterrupted.
Needless to say, I want the simplest SQL one can come up with. It would use OLAP window functions.
I am usually pretty good with complicated SQL, but with sequences I am not so good. I will work on this a little bit myself, of course, and annex some ideas below this question in a subsequent edit.
Let's begin by defining the table for our discussion:
CREATE TABLE Seq (
num integer,
key char
);
UPDATE 1: doing some research I find a similar question here: How to find consecutive rows based on the value of a column? but both the question and the answers are wrapped up into a lot of extra stuff and confusing.
UPDATE 2: I already got one answer, thanks. Inspecting it now. Here is my test I am typing into PostgreSQL even as we speak:
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(2, 'B'), (3, 'B'), (5, 'B'),
(6, 'C'), (7, 'C'),
(8, 'A'), (9, 'A'),
(10, 'C'), (11, 'C');
UPDATE 3: First contender of a solution is this
SELECT key, min(num), max(num)
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
GROUP BY key, (num - seqnum)
ORDER BY min;
yields:
key | min | max
-----+-----+-----
A | 1 | 2
B | 2 | 3
B | 5 | 5
C | 6 | 7
A | 8 | 9
C | 10 | 11
(6 rows)
for some reason B repeats twice, I see why, I made a "mistake" in my test data, skipping sequence num 4 and going straight from 3 to 5.
This mistake is fortunate, because it allows me to point out that while in this example the sequence number is discrete, I am intending the sequence to arise from some continuous domain (e.g., time).
There is another "mistake" I made, in that I have num 2 repeated. Is that allowable? Probably not. So cleaning up the example, removing duplicate but leaving the gap:
DROP TABLE Seq;
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(3, 'B'), (4, 'B'), (6, 'B'),
(7, 'C'), (8, 'C'),
(9, 'A'), (10, 'A'),
(11, 'C'), (12, 'C');
this still leaves us with the duplicate B block:
key | min | max
-----+-----+-----
A | 1 | 2
B | 3 | 4
B | 6 | 6
C | 7 | 8
A | 9 | 10
C | 11 | 12
(6 rows)
Now going with that first intuition by Gordon Linoff and trying to understand it and add to it:
SELECT s.*, num - seqnum AS diff
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
ORDER BY num;
here is the num - seqnum trick before grouping:
num | key | seqnum | diff
-----+-----+--------+------
1 | A | 1 | 0
2 | A | 2 | 0
3 | B | 1 | 2
4 | B | 2 | 2
6 | B | 3 | 3
7 | C | 1 | 6
8 | C | 2 | 6
9 | A | 3 | 6
10 | A | 4 | 6
11 | C | 3 | 8
12 | C | 4 | 8
(11 rows)
I doubt that this is the answer quite yet.