PostgreSQL: How to select on non-aggregating column?

Question

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:

id |  name  | other_col
---+--------+----------
 1 | James  |    x
 2 | John   |    x
 2 | David  |    x
 3 | Emily  |    x
 4 | Cameron|    x
 4 | Thomas |    x

And so in this case, I'd just want the result:

name
-------
John
David
Cameron
Thomas

The following query works but it seems like an overkill to have two separate selects:

select name 
from view where id = ANY(select id from view 
                         WHERE other_col='x' 
                         group by id 
                         having count(id) > 1) 
      and other_col='x';

I believe it should be possible to do something under the lines of:

select name from view WHERE other_col='x' group by id, name having count(id) > 1;

But this returns nothing at all! What is the 'proper' query?

Do I just have to it like my first working suggestion or is there a better way?

Are you simply trying to avoid using more than one `SELECT` statement? — Nick, Jul 15 '16 at 20:47
@Nicarus Yes. I want to remove the redundant WHERE other_col='x' (the real conditions are quite long) so removing 1 `SELECT` will make it clearer. — nico, Jul 15 '16 at 20:51

score 2 · Answer 1 · edited Jul 15 '16 at 20:47

2

SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp

edited Jul 15 '16 at 20:47

Nick

7,103
2
21
43

answered Jul 15 '16 at 20:40

Muthaiah PL

1,048
3
15
26

This is pretty much identical to what I suggested and I'm wondering if there's a better way to accomplish this... – nico Jul 15 '16 at 20:43
May i know in what terms you are looking for a better way? – Muthaiah PL Jul 15 '16 at 20:46
Specifically, I just want to be able to know if this can be done in a single query because I'd rather not have to repeat two separate selects (if possible). The actual where conditions are quite long and I'd like to simplify it. – nico Jul 15 '16 at 20:48

score 2 · Accepted Answer · answered Jul 15 '16 at 20:55

2

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:

WITH cte AS
(
SELECT
    id,
    name,
    other_col,
    COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
    table
)

SELECT name FROM cte WHERE id_count > 1;

You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

answered Jul 15 '16 at 20:55

Nick

7,103
2
21
43

Absolutely perfect! Thanks Nicarus, this is exactly what I was looking for. – nico Jul 15 '16 at 21:00
Good one. BTW, I have studied Temp Tables and derived tables are good in performance when comparing with CTE, am i right? – Muthaiah PL Jul 15 '16 at 22:15
2

@MuthaiahPL - In my experience it depends on what you are doing, but temp tables do allow for such things as adding indexes - CTEs do not. In this case the question was not about performance, but simplicity of writing the query, which I believe the CTE provides. – Nick Jul 15 '16 at 22:32

score 0 · Answer 3 · answered Jul 15 '16 at 20:43

0

Use EXIST operator

SELECT * FROM table t1
WHERE EXISTS(
  SELECT null FROM table t2
  WHERE t1.id = t2.id 
    AND t1.name <> t2.name
)

answered Jul 15 '16 at 20:43

krokodilko

35,300
7
55
79

Bohemian · Answer 4 · 2016-07-15T21:03:59.417

0

Use a join:

select distinct name 
from view v1
join view v2 on v1.id = v2.id
  and v1.name != v2.name

The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.

A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

edited Jul 15 '16 at 21:03

answered Jul 15 '16 at 20:51

Bohemian

412,405
93
575
722

This isn't a bad suggestion but it happens that my view is large so I'd have to go `on v1.id = v2.id AND v1.other_col='x' AND v2.other_col='x'` and it takes longer than the accepted answer. About the `id`, I will make sure to name my columns more appropriately in future questions, thanks! – nico Jul 15 '16 at 21:04

score 0 · Answer 5 · edited May 23 '17 at 12:14

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.

An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):

SELECT name 
FROM   view v
WHERE  other_col = 'x'
AND    EXISTS (
   SELECT 1 FROM view 
   WHERE  other_col = 'x' -- or: other_col = v.other_col
   AND    id <> v.id      -- exclude join to self
   );

That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.

Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:

CREATE INDEX ON view (id) WHERE other_col = 'x';

PostgreSQL does not use a partial index

The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:

CREATE INDEX ON view (id, name) WHERE other_col = 'x';

You will love this improvement (quoting the /devel manual):

Allow using an index-only scan with a partial index when the index's predicate involves column(s) not stored in the index (Tomas Vondra, Kyotaro Horiguchi)

An index-only scan is now allowed if the query mentions such columns only in WHERE clauses that match the index predicate

Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

PostgreSQL: How to select on non-aggregating column?

5 Answers5