0

I have an awkward situation where I have a table storing data about changes to a value over time.

I have a column groupId which groups together changes to a given value. I have value, which stores what the value changed to, and I have date which stores the date the change occurred.

e.g. if value a comes into being on 20000101 and changes to b on 20010101, we might have something like:

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      42 | a     | 20000101 |
|      42 | b     | 20010101 |
+---------+-------+----------+

Now to make things fun, we can have records which don't represent a material change in value e.g.

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      43 | a     | 20000101 |
|      43 | b     | 20010101 |
|      43 | b     | 20020101 |
+---------+-------+----------+

and for additional fun, we can have a value which changes to something else, then changes back to what it was before, like:

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      44 | a     | 20000101 |
|      44 | b     | 20010101 |
|      44 | a     | 20020101 |
+---------+-------+----------+

Combining these together, we can have a group that looks something like this:

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      45 | a     | 20000101 |
|      45 | a     | 20010101 |
|      45 | b     | 20020101 |
|      45 | b     | 20030101 |
|      45 | a     | 20040101 |
|      45 | a     | 20050101 |
|      45 | b     | 20060101 |
|      45 | b     | 20070101 |
+---------+-------+----------+

What I need to do is to write a query which will return rows for each group, but throwing away any of those non-material changes. For group 45 above, this would mean returning:

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      45 | a     | 20000101 |
|      45 | b     | 20020101 |
|      45 | a     | 20040101 |
|      45 | b     | 20060101 |
+---------+-------+----------+

i.e. we keep only the earliest date for each 'contiguous' group of (groupId,, value).

Is there any reasonable way to achieve this?

I'm doing this in MySQL although a solution which doesn't depend on that would be ideal.

rbennett485
  • 1,907
  • 15
  • 24

2 Answers2

0

using the method in this answer to simulate lag() in MySql:

SET @prev_value='';
select groupId, value, date
from (
  select groupId, @prev_value prev_value, @prev_value :=value value, date
  from t
  order by groupId, date
  ) a
where prev_value <> value;

rextester demo: http://rextester.com/PWF35736

returns:

+---------+-------+----------+
| groupId | value |   date   |
+---------+-------+----------+
|      45 | a     | 20000101 |
|      45 | b     | 20020101 |
|      45 | a     | 20040101 |
|      45 | b     | 20060101 |
+---------+-------+----------+
Community
  • 1
  • 1
SqlZim
  • 37,248
  • 6
  • 41
  • 59
0

First, we need to build information into the table itself that tells us when a change is immaterial. In this case, we know that a record is immaterial when two identical values appear next to each other in time. We can do this by assigning a "rank" that groups immaterial and material records together. Assuming our table is called A, the following query:

select    a1.groupID
        , a1.value
        , a1.date
        , COUNT(a2.groupID) as Ranked
    from A a1
    left join A a2
        on a2.groupID = a1.groupID
        and a2.value <> a1.value
        and a2.date < a1.date
    group by  a1.groupID
            , a1.value
            , a1.date
    order by  a1.date

produces this table:

+ ------- + ----- + ---------- + ------ +
| groupId | value | date       | Ranked |
+ ------- + ----- + ---------- + ------ +
| 45      | a     | 2000-01-01 | 0      |
| 45      | a     | 2001-01-01 | 0      |
| 45      | b     | 2002-01-01 | 2      |
| 45      | b     | 2003-01-01 | 2      |
| 45      | a     | 2004-01-01 | 2      |
| 45      | a     | 2005-01-01 | 2      |
| 45      | b     | 2006-01-01 | 4      |
| 45      | b     | 2007-01-01 | 4      |
+ ------- + ----- + ---------- + ------ +

Then by grouping on groupId, value, and ranked, we can select the min(date). Since MySQL does not support CTE's, we'll just use a temporary table

create temporary table Ranking as (
    select    a1.groupID
            , a1.value
            , a1.date
            , COUNT(a2.groupID) as Ranked
        from A a1
        left join A a2
            on a2.groupID = a1.groupID
            and a2.value <> a1.value
            and a2.date < a1.date
        group by  a1.groupID
                , a1.value
                , a1.date
        order by  a1.date
)

select    groupId
        , value
        , min(date) as date
    from Ranking
    group by  groupId
            , value
            , ranked
    order by date

and voila, we get the desired result

+ ------- + ----- + ---------- +
| groupId | value | date       |
+ ------- + ----- + ---------- +
| 45      | a     | 2000-01-01 |
| 45      | b     | 2002-01-01 |
| 45      | a     | 2004-01-01 |
| 45      | b     | 2006-01-01 |
+ ------- + ----- + ---------- +
KindaTechy
  • 1,041
  • 9
  • 25