0

Hope this makes sense...

We found a bug in our code which was duplicating records + 1 second, so we were slowly seeing the following data build up.

DATE
-----------------------------
07-DEC-15 04.12.53.000000000
07-DEC-15 04.12.54.000000000
07-DEC-15 04.12.55.000000000
07-DEC-15 04.12.56.000000000
07-DEC-15 04.12.57.000000000
07-DEC-15 04.12.58.000000000
07-DEC-15 04.12.59.000000000

07-DEC-15 10.12.54.000000000
07-DEC-15 10.12.55.000000000
07-DEC-15 10.12.56.000000000
07-DEC-15 10.12.57.000000000
07-DEC-15 10.12.58.000000000
07-DEC-15 10.12.59.000000000

07-DEC-15 16.12.55.000000000
07-DEC-15 16.12.56.000000000
07-DEC-15 16.12.57.000000000
07-DEC-15 16.12.58.000000000
07-DEC-15 16.12.59.000000000

07-DEC-15 22.12.56.000000000
07-DEC-15 22.12.57.000000000
07-DEC-15 22.12.58.000000000
07-DEC-15 22.12.59.000000000

We have fixed the bug but now need to remove all but 1 record for each 'duplicate' and it has been decided to keep the most recent value (in this case the DATE of DD-MMM-YY XX:XX:59 seconds), so we end up with this

DATE
-----------------------------
07-DEC-15 04.12.59.000000000

07-DEC-15 10.12.59.000000000

07-DEC-15 16.12.59.000000000

07-DEC-15 22.12.59.000000000

Note: This data comes in every 6 hours so we should see row 1, row 2, row 3, row 4

but because of the bug we were getting

row1
row1,row1,row2
row1,row1,row1,row2,row2,row3
row1,row1,row1,row1,row2,row2,row2,row3,row3
etc

with +1 second for each new row. Our display doesn't show seconds (don't know why) so they are appearing as duplicated data (even though technically they are not because of the 1 second)

KS1
  • 1,019
  • 5
  • 19
  • 35
  • Do you have groups of records that rolled over onto the next minute, rather than always ending at :59, and if so how should those be handled? Are the seconds always contiguous for a group, and there's a gap of at least one second between groups? – Alex Poole Apr 07 '16 at 15:22
  • The TIME would just roll over to 04:13:00, 04:13:01, 04:13:02 etc, etc, which we have accepted as OK. we would then have 04:13:02 as a 5th row in the final results. – KS1 Apr 07 '16 at 15:26
  • As a separate entry, even if it was contiguous with the first seven records you listed? Do you actually want both records or think you don't have a choice? – Alex Poole Apr 07 '16 at 15:28
  • ideally we'd get it down to 4 entries a day so I guess we could group by hour rather than minute (there's no chance we've gone round the hour so that shouldn't cause a problem) – KS1 Apr 07 '16 at 15:31

1 Answers1

2

If you want the last row from any contiguous sequence of values even if they span a minute boundary you can delete any which have a value one second later. The last entry in each group will not have a contiguous next row (otherwise it wouldn't be the end of the group) so that is retained.

With your sample data plus the three extra rows from your comment that cross the 04:13:00 boundary, so 25 rows in all:

delete from your_table t1 where exists (
  select null from your_table t2
  where t2.ts = t1.ts + interval '1' second
);

21 rows deleted.

select * from your_table;

TS                         
----------------------------
07-DEC-15 04.13.02.000000000
07-DEC-15 10.12.59.000000000
07-DEC-15 16.12.59.000000000
07-DEC-15 22.12.59.000000000

If you want records that span minute boundaries to be treated as separate groups, even if they are contiguous, you can include a minute-level filter:

delete from your_table t1 where exists (
  select null from your_table t2
  where t2.ts = t1.ts + interval '1' second
  and trunc(t2.ts, 'MI') = trunc(t1.ts, 'MI')
);

20 rows deleted.

select * from your_table;

TS                         
----------------------------
07-DEC-15 04.12.59.000000000
07-DEC-15 04.13.02.000000000
07-DEC-15 10.12.59.000000000
07-DEC-15 16.12.59.000000000
07-DEC-15 22.12.59.000000000

If you have other relevant columns like an ID you can include that in any joins of course.

And if you wanted the start of each group instead of the end, just change from + interval '1' second to - interval '1' second.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318
  • Just tested this on some data and looks like it's doing just as we would need! Thanks for that, marked up as appropriate. – KS1 Apr 07 '16 at 15:40
  • 1
    Great answer. A straight-forward solution with a very good explanation. – Thorsten Kettner Apr 07 '16 at 15:44
  • 1
    @TGray - that assumes the highest timestamp also has the highest rowid, which isn't necessarily true. That pattern is OK for getting rid of complete duplicates, but not where the order of near-duplicates matters, and you care which one you keep. – Alex Poole Apr 07 '16 at 17:20
  • @Alex Pool - you are probably right, particularly where async entries are done. – T Gray Apr 07 '16 at 17:23
  • Doesn't even need to be async, it can just reuse space in an old block. – Alex Poole Apr 07 '16 at 17:25
  • Can it work for min(timestamp) as well? If I cannot add rowId in BQ? – Elad Benda Sep 22 '21 at 00:32