0

I have a simple database table

create table demo (
  id     integer PRIMARY KEY,
  fv     integer,
  sv     text,
  rel_id integer,
  FOREIGN KEY (rel_id)
  REFERENCES demo(id));

and i want to delete all duplicate rows grouped by fv and sv. Which is already a fairly popular question with great answers. But I need a twist on that scenario. As in cases where rel_id is NULL I want to keep that row. In any other case anything goes.

So by using the following values

insert into demo (id,fv,sv,rel_id)
VALUES (1,1,'somestring',NULL),
       (2,2,'somemorestring',1),
       (3,1,'anotherstring',NULL),
       (4,2,'somemorestring',3),
       (5,1,'somestring',3)

Either

id | fv |     sv           | rel_id
---+----+------------------+-------
 1 | 1  | 'somestring'     | NULL
 2 | 2  | 'somemorestring' | 1
 3 | 1  | 'anotherstring'  | NULL

or

id | fv |     sv           | rel_id
---+----+------------------+-------
 1 | 1  | 'somestring'     | NULL
 3 | 1  | 'anotherstring'  | NULL
 4 | 2  | 'somemorestring' | 3

would be valid results. Where as

id | fv |     sv           | rel_id
---+----+------------------+-------
 3 | 1  | 'anotherstring'  | NULL
 4 | 2  | 'somemorestring' | 3
 5 | 1  | 'somestring'     | 3

would not be. As the first entry had NULL as rel_id which takes presidency above NOT NULL.

I currently have this (which is an answer on the basic duplicate question) as a query to remove duplicates but I am not sure how to continue to modify the query to fit my needs.

DELETE FROM demo
WHERE  id NOT IN (SELECT min(id) as id
                  FROM demo
                  GROUP BY fv,sv)

As as soon as the NOT NULL entry is inserted into the database before the NULL entry the NOT NULL one will not be deleted. It is guaranteed that rel_id will always point to an entry where rel_id is NULL therefore there is no danger of deleting a referenced entry. Further it is guaranteed that there will be no two rows in the same group with rel_id IS NULL. Therefore a row with rel_id IS NULL is unique for the whole table.

Or as a basic algorithm:

  1. Go over all rows and group them by fv and sv
  2. Look into each group for a row where rel_id IS NULL. If there is keep that row (and delete the rest). Else pick one row of your choice and delete the rest.

sqlfiddle

Community
  • 1
  • 1
Sim
  • 4,199
  • 4
  • 39
  • 77
  • I see you've done serious work on trying to solve, but I can't clearly get the task. Are there conditions you would like to delete records with `rel_id=NULL`? If not, just add `WHERE rel_id IS NOT NULL` to the inner `select` – Kirill Slatin Mar 17 '15 at 00:54
  • @KirillSlatin that would kill any row where there is a `NULL` entry in `rel_id`. I will edit in further explanation. – Sim Mar 17 '15 at 01:00

1 Answers1

0

I seem to have worked it out

DELETE FROM demo
    WHERE  id NOT IN (SELECT min(id) as id
                      FROM demo AS out_buff
                      WHERE rel_id IS NULL OR
                            NOT EXISTS (SELECT id FROM demo AS in_buff
                                        WHERE rel_id IS NULL AND
                                              in_buff.fv = out_buff.fv AND
                                              in_buff.sv = out_buff.sv)
                      GROUP BY fv,sv);

by selecting in the inner SELECT either only the row with the rel_id with the value NULL or all rows matching on the GROUP BY arguments, by using the anti-condition to the existence of a row with rel_id IS NULL. But my query looks really ineffective. As a naive assumption would put the running time at at least O(n^2).

Sim
  • 4,199
  • 4
  • 39
  • 77
  • you have 2 nested queries. It might have performance issues on large data sets. If I understood your explanation correctly, you need to [include NULL in your aggregate function MIN](http://stackoverflow.com/questions/21286215/how-can-i-include-null-values-in-a-min-or-max) – Kirill Slatin Mar 17 '15 at 03:15