2

Consider following table

id  || C1  || C2  || C3 || Timestamp
--------------------------------
1   || a   || b   || 1  || t1
2   || a   || b   || 1  || t2
3   || a   || b   || 2  || t3
4   || a   || b   || 2  || t4
5   || a   || b   || 2  || t5
6   || b   || d   || 3  || t6
7   || b   || d   || 3  || t7

C3 contains a distinct value for a given combination of C1 and C2. A given combination of C1, C2 and C3 can have multiple rows with differing timestamps(like rows 4,5).

However, duplication has lead to creation of more than one C3 for a given C1 and C2 pair. So here rows 3, 4 and 5 violated the rule as a new C3(2) was introduced. C3 is always unique given C1 and C2.

I want to keep the latest( w.r.t timestamp) combination of C1, C2, C3 and delete all the older ones. So only rows 3,4,5 should exist and 1,2 should be removed.
By latest I mean that older C3 values for given C1,C2 should be removed, so in this case rows 1,2 are removed but 3,4,5 remain.

enter image description here

For example, given unique nid and original columns, all the rows with mock_id YR should be removed.

I have tried various queries and joins, but unable to figure out something that would solve this exact problem.

arjunkhera
  • 929
  • 6
  • 23
  • This question is asked and answered (often correctly) several times a day. Note that for large datasets, it can be quicker to construct a new table, retaining just the desired data, and then replace the old table with that. – Strawberry Dec 03 '19 at 07:45
  • 1
    In addition, the expected output you gave does not agree with the sentence which precedes it. – Tim Biegeleisen Dec 03 '19 at 07:46
  • I have edited the question to provide more clarity. Consider C3 as a unique hash value, C1, C2 should only have only one, but somehow I have now got another C3. So I need to remove rows containing the older C3 values. – arjunkhera Dec 03 '19 at 10:58

4 Answers4

3

We can try using exists logic here:

DELETE
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
              WHERE t2.C1 = t1.C1 AND t2.C2 = t1.C2 AND t2.C3 > t1.C3);

In plain English, this query says to delete any record for which we can find another having the same C1 and C2 values, but a greater C3 value. If such another record exists, then it means that the current record being examined is not the latest and should be deleted.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

C3 contains a distinct value for a given combination of C1 and C2.

Then there should be two tables. One that assigns a C3 to each C1/C2 pair and one that contains either C1+C2 or C3, plus the timstamp.

Anyway, you want to delete all rows for which a newer timestamp for the same C1/C2 pair with a different C3 exists:

delete from mytable
where exists
(
  select null
  from (select * from mytable) newer
  where newer.c1 = mytable.c1
    and newer.c2 = mytable.c2
    and newer.c3 <> mytable.c3
    and newer.timestamp > mytable.timestamp
);

Writing from (select * from mytable) newer instead of a mere from mytable newer is necessary in MySQL, because of a weird restriction that you cannot use the same table directly in an update or delete.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • The timestamp appears to not be relevant, q.v. the comments under my answer. – Tim Biegeleisen Dec 03 '19 at 11:13
  • @Tim Biegeleisen: Your answer before the last edit that arjunkhera's comment refers to looked almost like mine, but you had `t2.C3 = t1.C3`. Thus you didn't delete old different C3s as requested, but kept all latest rows per C1/C2/C3 instead. I do think the timestamp is important and we shouldn't rely on a new C3 being greater than a previous one. I am convinced my query is correct. – Thorsten Kettner Dec 03 '19 at 11:21
1

In MySQL, use JOIN and GROUP BY:

delete t
    from t join
         (select c1, c2, c3, max(timestamp) as max_timestamp
          from t
          group by c1, c2, c3
         ) tt
         on tt.c1 = t.c1 and
            tt.c2 = t.c2 and
            tt.c3 = t.c3 and
            tt.max_timestamp > t.timestamp;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

The following is the solution I have come up with in regards to my problem, and it works fine on a test data set. Gordons answer was near, but resulted in the same logic as provided by Tim. I changed that a bit to just fetch the most up to date C3 values and then delete the ones that are not present on this set.

I had to use "from" twice due to a mysql error that prevents updates on target table through from clause. Here is an answer that explains the problem better.

delete from strings where strings.C3 not in 
( select C3 from 
    ( select C3 from strings join 
        ( select C1, C2, max(timestamp) as max_timestamp from strings group by C1, C2 ) t2
        on strings.C1 = t2.C1
        and strings.C2 = t2.C2
        and strings.Timestamp = max_timestamp 
    ) as t3
);
arjunkhera
  • 929
  • 6
  • 23
  • This works provided the same C3 cannot occur for different C1/C2 pairs. I find my answer more readable, but you are right, that MySQL has a problem with using the same table in an update or delete, which is usually circumvented with replacing `from mytable` by `from (select * from mytable)`. I'll update my answer accordingly. – Thorsten Kettner Dec 04 '19 at 09:41
  • And I still think that the main problem you are facing is the flawed data model that made the inconsistencies even possible. – Thorsten Kettner Dec 04 '19 at 09:45
  • The value of C3 is always distinct. So for any given C1/C2 pair, it will always be unique. I have mentioned that in my question. – arjunkhera Dec 04 '19 at 09:45
  • Well, I inherited this DB and need to get this solved asap. But thanks for the suggestion, will keep in mind when moving this to a new schema – arjunkhera Dec 04 '19 at 09:46
  • I know that C3 is considered unique for a C1/C2 pair (which is essentially the problem, as you must now delete inconsistent additional C3s). But what about *other* C1/C2? Can they have the same C3? If so, your delete statement is not sufficient. – Thorsten Kettner Dec 04 '19 at 09:48
  • The value of C3 here is always unique across the table. I agree, if that was not the case then this would not work. – arjunkhera Dec 04 '19 at 09:51