1

I have two tables: parcel and structure, with a one-to-many relationship between them: structure.parcel_id points to parcel.id.

I want to select all the single structures. My current solution works, but is very grotesque:

SELECT 
max(column_1),
max(column_2),
max(column_3),
...
(twenty+ columns)

FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

Because structure.id is non-nullable and the HAVING clause above, every group has, by definition, only one row in it. Unfortunately Postgres doesn't realize this, so if I say:

SELECT *    
FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

Then I get the expected error about needing to use an aggregate function for the columns. I get around this with the arbitrary max() function, but this is confusing to someone else trying to understand the code, and it forces me to explicitly list all the columns, which means I have to dive back in and edit this code whenever a column is added. (Which, unfortunately, happens fairly frequently in my environment.)

I have this alternate solution, which solves most of my problems:

SELECT * FROM STRUCTURE
WHERE id IN (
    SELECT
        max(id) as id
    FROM structure
    GROUP by structure.parcel_id
    HAVING count(structure.id)  = 1
    );

But this is clearly adding unnecessary slowness to my query which I'd like to avoid because of the frequency of the query and the size of the table.

This question is very similar to what I'm asking, but it will grab the first row of every group, not the first (and only) row of singular groups.

Is there an elegant way to solve this problem?

Sample data per request:

structure table:

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
1  |   536     |   ...    | ....     | ...
2  |   536     |   ...    | ....     | ...
3  |   537     |   ...    | ....     | ...
4  |   538     |   ...    | ....     | ...
5  |   538     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...
8  |   541     |   ...    | ....     | ...
9  |   541     |   ...    | ....     | ...

Desired result:

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
3  |   537     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...

Note that 537, 539, and 540 are the only parcel_id's that don't repeat.

Both tables have ~1.5 million rows and ~25 columns.

Community
  • 1
  • 1
lnhubbell
  • 3,304
  • 5
  • 17
  • 22
  • Can you please post some representative sample data and the desired result. – PM 77-1 Mar 01 '16 at 02:45
  • So, the `structure` table has 1.5M rows. How many rows are in the `parcel` table? How many parcels have only one corresponding row in the `structure` table? In other words, how many rows the final query would return? – Vladimir Baranov Mar 01 '16 at 05:04
  • I m not well versed with SQL so posting my suggestion in comment. You can use self join as this 'SELECT S1.* FROM STRUCTURE S1, STRUCTURE S2 WHERE S1.parcel_id = S2.parcel_id GROUP BY S2.parcel_id HAVING count(S2.parcel_id) = 1' – Rajen Raiyarela Mar 01 '16 at 05:42

2 Answers2

1

How about using window functions?

SELECT s.*    
FROM (SELECT s.*, COUNT(*) OVER (PARTITION BY parcel_id) as cnt
      FROM structure s
     ) s
WHERE cnt = 1;

However, a more efficient method might be:

select s.*
from structure s
where not exists (select 1
                  from structure s2
                  where s2.parcel_id = s.parcel_id and s2.id<> s.id
                 );

In particular, this can take advantage of an index on structure(parcel_id, id).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Your second method is quite clever! I like it, but after some (albeit rather rough) benchmarking and query plan analyzing, it looks like it will be slower than my grotesque solution. I'm hoping for a solution that can at least match it in terms of speed. – lnhubbell Mar 01 '16 at 03:08
  • @inhubbell . . . Do you have the appropriate index? I would expect the second message to be pretty speedy with the index. – Gordon Linoff Mar 02 '16 at 00:38
1

This should be considerably faster:

SELECT s.*
FROM  (
   SELECT parcel_id
   FROM   structure
   GROUP  BY 1
   HAVING count(*) = 1
   ) s1
JOIN structure s USING (parcel_id);

All you need is an index on (parcel_id).

  • Since the query is restricted to unique parcel_id, there is no need to involve id in the subquery. So we can get an index-only scan out of a plain index on just (parcel_id) - and use the same index for the join.

  • The join should be a bit faster than IN with a big subselect. (Though they mostly result in the same query plan in modern Postgres.)

  • count(*) is a bit faster than count(<expression>), since only the existence of a row hast to be established.

Asides:

@Gordon's 2nd query with the NOT EXISTS anti-semi-join should be fast, too. You just need a multicolumn index on (parcel_id, id).

The question you linked to is for SQL Server. Here is a more relevant related question for Postgres:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228