2

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:

id |  name  | other_col
---+--------+----------
 1 | James  |    x
 2 | John   |    x
 2 | David  |    x
 3 | Emily  |    x
 4 | Cameron|    x
 4 | Thomas |    x

And so in this case, I'd just want the result:

name
-------
John
David
Cameron
Thomas

The following query works but it seems like an overkill to have two separate selects:

select name 
from view where id = ANY(select id from view 
                         WHERE other_col='x' 
                         group by id 
                         having count(id) > 1) 
      and other_col='x';

I believe it should be possible to do something under the lines of:

select name from view WHERE other_col='x' group by id, name having count(id) > 1;

But this returns nothing at all! What is the 'proper' query?

Do I just have to it like my first working suggestion or is there a better way?

Luc M
  • 16,630
  • 26
  • 74
  • 89
nico
  • 2,022
  • 4
  • 23
  • 37

5 Answers5

2
SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp
Nick
  • 7,103
  • 2
  • 21
  • 43
Muthaiah PL
  • 1,048
  • 3
  • 15
  • 26
  • This is pretty much identical to what I suggested and I'm wondering if there's a better way to accomplish this... – nico Jul 15 '16 at 20:43
  • May i know in what terms you are looking for a better way? – Muthaiah PL Jul 15 '16 at 20:46
  • Specifically, I just want to be able to know if this can be done in a single query because I'd rather not have to repeat two separate selects (if possible). The actual where conditions are quite long and I'd like to simplify it. – nico Jul 15 '16 at 20:48
2

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:

WITH cte AS
(
SELECT
    id,
    name,
    other_col,
    COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
    table
)

SELECT name FROM cte WHERE id_count > 1;

You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

Nick
  • 7,103
  • 2
  • 21
  • 43
  • Absolutely perfect! Thanks Nicarus, this is exactly what I was looking for. – nico Jul 15 '16 at 21:00
  • Good one. BTW, I have studied Temp Tables and derived tables are good in performance when comparing with CTE, am i right? – Muthaiah PL Jul 15 '16 at 22:15
  • 2
    @MuthaiahPL - In my experience it depends on what you are doing, but temp tables do allow for such things as adding indexes - CTEs do not. In this case the question was not about performance, but simplicity of writing the query, which I believe the CTE provides. – Nick Jul 15 '16 at 22:32
0

Use EXIST operator

SELECT * FROM table t1
WHERE EXISTS(
  SELECT null FROM table t2
  WHERE t1.id = t2.id 
    AND t1.name <> t2.name
)
krokodilko
  • 35,300
  • 7
  • 55
  • 79
0

Use a join:

select distinct name 
from view v1
join view v2 on v1.id = v2.id
  and v1.name != v2.name

The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.


A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • This isn't a bad suggestion but it happens that my view is large so I'd have to go `on v1.id = v2.id AND v1.other_col='x' AND v2.other_col='x'` and it takes longer than the accepted answer. About the `id`, I will make sure to name my columns more appropriately in future questions, thanks! – nico Jul 15 '16 at 21:04
0

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.

An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):

SELECT name 
FROM   view v
WHERE  other_col = 'x'
AND    EXISTS (
   SELECT 1 FROM view 
   WHERE  other_col = 'x' -- or: other_col = v.other_col
   AND    id <> v.id      -- exclude join to self
   );

That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.

Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:

CREATE INDEX ON view (id) WHERE other_col = 'x';

The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:

CREATE INDEX ON view (id, name) WHERE other_col = 'x';

You will love this improvement (quoting the /devel manual):

Allow using an index-only scan with a partial index when the index's predicate involves column(s) not stored in the index (Tomas Vondra, Kyotaro Horiguchi)

An index-only scan is now allowed if the query mentions such columns only in WHERE clauses that match the index predicate

Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228