Filter duplicate rows in Postgres based on conditions between those rows

Question

Given a table

CREATE TABLE data(
 irs_number VARCHAR (50),
 mop_up INTEGER,
 ou VARCHAR (50)
);

How would I return all matching records that...

have at least one identical value for irs_number in another row AND
at least one mop_up of those with the same irs_number must be set to 1 AND
the ou values are not identical, i.e. only return those not matching to a row with the identical irs_number.

... so that all irs_numbers would be returned (not only one where the conditions are true - see example below).

I tried this but the query cannot finish within reasonable time:

SELECT irs_number, mop_up, ou
FROM data outer_data
WHERE (SELECT count(*)
FROM data inner_data
WHERE inner_data.irs_number = outer_data.irs_number
AND inner_data.mop_up = 1 OR outer_data.mop_up = 1
AND inner_data.ou <> outer_data.ou
);

As well as variations of duplicate counts as described here: How to find duplicate records in PostgreSQL - they will always just return the duplicates but not the proper filter applied.

edit:

Example data:

INSERT INTO data VALUES 
('0001', 1, 'abc'),
('0001', 0, 'abc'),
('0001', 0, 'cde'),
('0001', 0, 'abc'),
('0002', 1, 'abc'),
('0002', 0, 'abc'),
('0003', 0, 'abc'),
('0003', 0, 'xyz')
;

SQLFiddle: http://sqlfiddle.com/#!17/be28f

a query should ideally return:

irs_number  mop_up  ou
-----------------------
0001        1       abc
0001        0       abc
0001        0       cde
0001        0       abc

(order not important) meaning it should return all rows matching having the irs_number with the condition above.

forpas · Answer 1 · 2019-03-07T18:24:17.810

1

I think this join will do:

SELECT * FROM data 
WHERE irs_number in (
  SELECT irs_number
  FROM data d
  WHERE EXISTS (SELECT 1
    FROM data 
    WHERE irs_number = d.irs_number
    AND (mop_up = 1 OR d.mop_up = 1)
    AND ou <> d.ou
  )
)

See the demo

edited Mar 07 '19 at 18:24

answered Mar 07 '19 at 17:20

forpas

160,666
10
38
76

it does return also rows that have just a single `irs_number` @forpas – dh762 Mar 07 '19 at 17:40
This joins rows with the same `irs_number` and different `ou`, meaning different rows with the same `irs_number`. Can you post sample data and expected results so to be clear? – forpas Mar 07 '19 at 17:44
correction to the comment: it does return one row only if there are duplicate `irs_numbers` within the same `ou` (at least one has `mop_up` = true). happy to add sample data – dh762 Mar 07 '19 at 17:48
Why should `0001 0 abc` be returned? There is not a row with the same irs_number and different ou with mop_up = 1 to any of the 2 rows. – forpas Mar 07 '19 at 18:16
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189610/discussion-between-dh762-and-forpas). – dh762 Mar 07 '19 at 18:17
I think this answer is more complicated than necessary for the question. – Gordon Linoff Mar 07 '19 at 19:06
@GordonLinoff then why don't you post an answer less complicated that does solve problem? – forpas Mar 07 '19 at 19:16

Gordon Linoff · Accepted Answer · 2019-03-07T20:54:09.637

1

You should be able to do this with a simple exists clause:

SELECT irs_number, mop_up, ou
FROM data d
WHERE EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.mop_up = 1 AND
                    d2.ou <> d.ou
             );

EDIT:

The above misinterpreted the question. It assumed that a mop_up = 1 needed to be on a different ou. As I read the question, this is ambiguous but doesn't appear to be what you want. So, two exists address this:

SELECT irs_number, mop_up, ou
FROM data d
WHERE EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.mop_up = 1
             ) AND
     EXISTS (SELECT 1
              FROM data d2
              WHERE d2.irs_number = d.irs_number AND
                    d2.ou <> d.ou
             );

Here is a db<>fiddle.

Both these solutions will be able to take advantage of an index on (irs_number, mop_up, ou).

edited Mar 07 '19 at 20:54

answered Mar 07 '19 at 19:06

Gordon Linoff

1,242,037
58
646
786

This has nothing to do with the requirements of the question. – forpas Mar 07 '19 at 19:14
this would return only those with `mop_up` set to `1` but not all within where at least on `mop_up` was set to `1` – dh762 Mar 07 '19 at 20:42
@dh762 . . . Not at all. This would return all `irs_numbers` that have a corresponding row with `mop_up = 1` -- which is what you are asking for. I have no idea why you are confusing the `where` clause in the correlated subquery with what the outer query returns. – Gordon Linoff Mar 07 '19 at 20:44
correct, but this would only serve as a subquery because the final query should return all records with that subqueried `irs_number` - see example - your query does only return 1 row with `0001` instead of all 4 with `0001`. It might not be obvious from the pre-edit question (will update) – dh762 Mar 07 '19 at 20:46
@dh762 . . . I see. I interpreted the second and third bullet points differently from what you intended. I've adjusted the answer. – Gordon Linoff Mar 07 '19 at 20:55
this solution is 2x as fast as https://stackoverflow.com/a/55049567/1344855 on the same database – dh762 Mar 07 '19 at 21:06

Filter duplicate rows in Postgres based on conditions between those rows

2 Answers2