duplicate elimination with time

Question

For my Smart home Stuff I create a Database, but I make a mistake during programming: the application posting stuff into the Database twice. I want to delete all rows, which contain duplicates. With duplicate I mean a tuples what is identically in the data to the last one from the same type. I mark the duplicates in this Example with "<<" please pay also attention to the last 3 rows. I want to keep the first new Data so I want to delete all the Duplicate after them. I still hope you can help me to solve my Problem.

SmartHome=# select * from sensordata order by time desc Limit 21;
 type | data |            time
------+------+----------------------------
    8 | 2459 | 2019-08-09 23:10:39.530087 << 
    8 | 2459 | 2019-08-09 23:10:39.356908
    8 | 2445 | 2019-08-09 23:05:39.933269 <<
    8 | 2445 | 2019-08-09 23:05:39.789173
   10 | 6105 | 2019-08-09 22:50:50.40792  <<
   10 | 6105 | 2019-08-09 22:50:50.096132
    8 | 2459 | 2019-08-09 22:50:41.429681 <<
    8 | 2459 | 2019-08-09 22:50:41.357483
    8 | 2474 | 2019-08-09 22:45:42.13396  <<
    8 | 2474 | 2019-08-09 22:45:41.813046
   10 | 6221 | 2019-08-09 22:40:51.107709 <<
   10 | 6221 | 2019-08-09 22:40:51.076903
   10 | 6105 | 2019-08-09 22:35:51.737255 <<
   10 | 6105 | 2019-08-09 22:35:51.544886
   10 | 6221 | 2019-08-09 22:30:52.493895 <<
   10 | 6221 | 2019-08-09 22:30:51.795203
    8 | 2459 | 2019-08-09 22:30:43.193447 <<
    8 | 2459 | 2019-08-09 22:30:43.045599
   10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above
   10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more
   10 | 6105 | 2019-08-09 22:20:51.356846    time between the rows
(21 rows)

SmartHome=# \d sensordata
                   Table "public.sensordata"
 Column |            Type             |       Modifiers
--------+-----------------------------+------------------------
 type   | integer                     | not null
 data   | character varying(20)       | not null
 time   | timestamp without time zone | not null default now()
Indexes:
    "smarthome_idx" UNIQUE, btree (type, "time")
Foreign-key constraints:
    "sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)

If i run

with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;

the output is:

 10 | 17316 | 2019-08-09 09:43:46.938507 |          1
   10 | 18276 | 2019-08-09 09:38:47.129788 |          1
   10 | 18176 | 2019-08-09 09:33:47.889064 |          1
   10 | 17107 | 2019-08-08 10:36:11.383106 |          1
   10 | 17921 | 2019-08-08 09:56:15.889191 |          1
   10 | 17533 | 2019-08-03 09:30:11.047639 |          1

thats not what i mean :/ (ßorry dont know how to mark the stuff as code block in the comment therfore this way

Insert the records into another table with select distinct and the miliseconds trimmed off. The insert the results back into sensor data. — alexherm, Aug 09 '19 at 21:42
You'll miss duplicates that go over a second boundary though. — Simon Notley, Aug 09 '19 at 21:47

Simon Notley · Accepted Answer · 2019-08-10T11:48:15.320

0

There are many possible ways to do this. The fastest is often a correlated subquery but I can never remember the syntax so I normally go for window functions, specifically row_number().

If you run

Select *, row_number() over(partition by type,data order by date) from sensor data

That should give a version of your table where all the rows you want to keep have a number 1 and the duplicates are numbered 2,3,4... Use that same field in a delete query and you'll be sorted.

EDIT: I understand now you only want to remove duplicates that occur sequentially within the same type. This can also be achieved using row_number and join.This query should give you only the data you want.

WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata)
SELECT a.* 
FROM s a 
JOIN s b 
ON a.rnum=b.rnum+1 AND a.type=b.type
WHERE NOT a.data=b.data

This might need a slight tweak to avoid missing the very first entry if that's important.

edited Aug 10 '19 at 11:48

answered Aug 09 '19 at 21:53

Simon Notley

2,070
3
12
18

I merge my Comment into the question, beacuse i dont know how to commment a codeblock :/ – Root_DE Aug 09 '19 at 22:28
with this querry i would i got not all duplicates, eg this one is still there `10 | 6105 | 2019-08-10 00:35:47.137982 | 32616 ` ` 8 | 2459 | 2019-08-10 00:35:46.906772 | 32615 ` `10 | 6105 | 2019-08-10 00:29:47.087688 | 32614 ` – Root_DE Aug 10 '19 at 11:12
I literally just edited my answer in response to feedback on the other answer. Try the new version. – Simon Notley Aug 10 '19 at 11:14

score 0 · Answer 2 · edited Aug 10 '19 at 11:56

0

You can identify the duplicates using lag():

select t.*
from (select t.*,
             lag(time) over (partition by type, data order by time) as prev_time
      from sensordata t
     ) t
where prev_time > time - interval '10 minute';

This assumes an interval of 10 minutes for a duplicate. You don't specify in the question.

Then, you can delete these using a join. It would be better if you had a serial primary key, so you knew that data was unique. But it seems reasonable to assume that type/data/time will be unique.

So:

delete t
    from (select t.*,
                 lag(time) over (partition by type, data order by time) as prev_time
          from sensordata t
         ) tt
    where tt.prev_time > t.time - interval '10 minute' and
          tt.type = t.type and
          tt.data = t.data and
          tt.time = t.time;

Note that if you do have exact duplicates in the table, then you might end up duplicating all the rows for a given combination, including the original one.

edited Aug 10 '19 at 11:56

Root_DE

71
6

answered Aug 09 '19 at 22:53

Gordon Linoff

1,242,037
58
646
786

Good answer as I've learnt a new function, never knew about lag(). However we both made the mistake of assuming the OP wanted to partition by type. It seems that actually the requirement is to remove duplicate consecutive records with any amount of elapsed time but no partitioning. I've updated my answer to achieve this. – Simon Notley Aug 10 '19 at 07:00
@SimonN . . . The question explicitly states: "With duplicate I mean a tuples what is identically in the data to the last one from the same type." – Gordon Linoff Aug 10 '19 at 11:01
So it does, not sure how I managed to misread that! Now my answer needs editing again. – Simon Notley Aug 10 '19 at 11:02

score 0 · Answer 3 · answered Aug 10 '19 at 13:04

if you want to select or delete records based on the existence of other records, EXISTS() is your friend.

First:

define exactly what you consider a set of duplicates
decide which one you want to keep, and which ones to delete

Assuming that you want to delete the most recent records, but only if the data did not change between the one to keep and the one to delete.

select the ones to delete:

SELECT * 
FROM sensordata d
WHERE EXISTS(                   -- if there exists a record with...
        SELECT * FROM sensordata x
        WHERE x.ztype = d.ztype -- the same type
        AND x.zdata = d.zdata   -- the same data
        AND x.ztime <d.ztime    -- but:older
        AND NOT EXISTS(                 -- and NO record exists with...
        SELECT * FROM sensordata nx
                WHERE nx.ztype = d.ztype -- the same type
                AND nx.zdata <> d.zdata -- but DIFFERENT data
                AND nx.ztime <d.ztime   -- between the one to delete
                AND nx.ztime >x.ztime   -- and the oldest one
                )
        );

If that looks OK, you can use exactly the same condition in a DELETE statement.

DELETE FROM sensordata d
WHERE EXISTS(                   -- if there exists a record with...
        SELECT * FROM sensordata x
        WHERE x.ztype = d.ztype -- the same type
        AND x.zdata = d.zdata   -- the same data
        AND x.ztime <d.ztime    -- but:older
        AND NOT EXISTS(                 -- and NO record exists with...
        SELECT * FROM sensordata nx
                WHERE nx.ztype = d.ztype -- the same type
                AND nx.zdata <> d.zdata -- but DIFFERENT data
                AND nx.ztime <d.ztime   -- between the one to delete
                AND nx.ztime >x.ztime   -- and the oldest one
                )
        );

duplicate elimination with time

3 Answers3