How to delete duplicate data which has less than 10 seconds difference in postgreSQL

Question

Currently, the table is like below.

datetime	number	content
2018-01-01 02:49:04	1	spring
2018-01-01 02:49:10	1	spring
2018-01-01 02:49:24	1	spring
2018-01-01 02:49:29	1	summer
2018-01-01 02:49:44	1	spring
2018-01-01 02:49:49	1	spring
2018-01-01 02:49:50	1	winter
2018-01-01 02:49:51	1	spring

If 'number' and 'content' columns have the same values, the time difference will have to be more than 10 seconds. ( * means it should be removed) So the table should be like below.

datetime	number	content
2018-01-01 02:49:04	1	spring
2018-01-01 02:49:24	1	spring
2018-01-01 02:49:29	1	summer
2018-01-01 02:49:44	1	spring
2018-01-01 02:49:50	1	winter

I refered to Delete Duplicate Data on PostgreSQL, but it's quite different from my case. I think the code would be like

DELETE FROM table a USING table b 
WHERE
(DATETIME CALCULATION CODE)
AND a.number = b.number 
AND a.content = b.content

Thank you for helping in advance.

Do you want to apply this rule recursively? So if you delete a row and that makes the gap between the next row and the previous "survivor" row bigger, should it then also delete those rows? — , Feb 17 '21 at 11:10

wildplasser · Answer 1 · 2021-02-18T10:42:29.020

You can use exists():

DELETE FROM foo del
WHERE EXISTS (
        SELECT * FROM foo x
        WHERE x.znumber = del.znumber AND x.content = del.content
        AND x.dt < del.dt
        AND x.dt >= del.dt - '10 sec'::interval
        )
        ;

select * from foo;

Result:

DELETE 16
 id |         dt          | znumber | content 
----+---------------------+---------+---------
  1 | 2018-01-01 02:49:04 |       1 | spring
 16 | 2018-01-01 02:49:29 |       1 | summer
 17 | 2018-01-01 02:49:44 |       1 | spring
 19 | 2018-01-01 02:49:50 |       1 | winter
(4 rows)

Extra: if you want to eliminate the middle men, but keep the records on the sides, you could use: [this will always keep the first and last records, even if they are too close]

WITH laglead AS (
        SELECT id
        -- , dt AS this
        -- , znumber, content
        , lag(dt) OVER www AS prev
        , lead(dt) OVER www AS next
        FROM foo x
        WINDOW www AS (PARTITION BY znumber,content ORDER BY dt)
        )
DELETE FROM foo del
WHERE EXISTS (
        SELECT *
        FROM laglead x
        WHERE x.id = del.id
        AND x.next < x.prev + '10 sec'::interval
        )
        ;

score 1 · Answer 2 · answered Feb 17 '21 at 11:09

You can first calculate all rows that have a gap smaller than 10 seconds and use the ctid values of those rows to delete them. If you have a real unique key (e.g. an identity) column it would be better to use that, as the ctid comparison is quite slow.

with flagged as (
  select ctid as rid, 
         row_number() over w as rn,
         datetime - lag(datetime) over w < interval '10 seconds' as small_gap
  from the_table
  window w as (partition by number, content order by datetime)
)
delete from the_table
where ctid in (select rid
               from flagged
               where small_gap)

Online example

score 0 · Answer 3 · answered Feb 17 '21 at 09:47

0

Assuming that recodrs in your table have unique id-s, first find the id-s of those that are to be deleted using lag window function and then delete them by id. The reason to do it in two steps is because you can not use window functions in a where clause.

with t as
(
 select id, 
 "datetime" - lag("datetime", 1) over (partition by "number", "content" order by "datetime") < interval '10 seconds' as to_delete
 from _table
)
delete from _table where id in (select id from t where to_delete);

answered Feb 17 '21 at 09:47

Stefanov.sm

11,215
2
21
21

I guess this would not work as expected. If you had such series: 0:00:00, 00:00:09, 00:00:11 you would delete the second and third record, althought you only need to remove the second (than the difference between first and third is big enough). – S-Man Feb 17 '21 at 10:30

score 0 · Answer 4 · answered Feb 17 '21 at 10:41

I don't think you can do this with window functions in the general case.

Suppose the table contains a sequence of timestamps, one every second. Your requirement of "time difference will have to be more than 10 seconds" means keep the first timestamp, then skip next 9 records, but the next record will have 10 seconds difference with the first so it should be kept. This means this cannot be solved by only comparing current and previous row, rather it must compare the current row and the last row that will be kept, delete rows if the difference is less than 10 seconds, and keep the row if it is not.

So... plpgsql.

DROP TABLE foo;
CREATE TABLE foo( id SERIAL PRIMARY KEY,
dt TIMESTAMP WITHOUT TIME ZONE,
number INT NOT NULL, content TEXT NOT NULL);
\copy foo(dt,number,content) FROM stdin
2018-01-01 02:49:04 1   spring
2018-01-01 02:49:10 1   spring
2018-01-01 02:49:11 1   spring
2018-01-01 02:49:12 1   spring
2018-01-01 02:49:13 1   spring
2018-01-01 02:49:14 1   spring
2018-01-01 02:49:15 1   spring
2018-01-01 02:49:16 1   spring
2018-01-01 02:49:17 1   spring
2018-01-01 02:49:18 1   spring
2018-01-01 02:49:19 1   spring
2018-01-01 02:49:20 1   spring
2018-01-01 02:49:21 1   spring
2018-01-01 02:49:22 1   spring
2018-01-01 02:49:24 1   spring
2018-01-01 02:49:29 1   summer
2018-01-01 02:49:44 1   spring
2018-01-01 02:49:49 1   spring
2018-01-01 02:49:50 1   winter
2018-01-01 02:49:51 1   spring
\.

CREATE OR REPLACE FUNCTION foo_del( )
RETURNS SETOF INT
AS $$
DECLARE
    row         foo%ROWTYPE;
    last_row    foo%ROWTYPE;
BEGIN
    FOR row IN SELECT * FROM foo ORDER BY number, content, dt
    LOOP
        IF last_row.id IS NULL OR row.number != last_row.number OR row.content != last_row.content 
            OR row.dt >= last_row.dt THEN
            last_row := row;
            last_row.dt := last_row.dt + '10 SECOND'::INTERVAL;
        ELSE
            IF row.dt < last_row.dt THEN
                RETURN NEXT row.id;
            END IF;
        END IF;
    END LOOP;
END;
$$ LANGUAGE PLPGSQL;

SELECT * FROM foo LEFT JOIN (SELECT * FROM foo_del()) d ON (foo.id=d.foo_del) ORDER BY id;

DELETE FROM foo WHERE id IN (SELECT * FROM foo_del());

SELECT * FROM foo ORDER BY id;

How to delete duplicate data which has less than 10 seconds difference in postgreSQL

4 Answers4