SELECT DISTINCT to return at most one row

Question

Given the following db structure:

Regions

id	name
1	EU
2	US
3	SEA

Customers:

id	name	region
1	peter	1
2	henry	1
3	john	2

There is also a PL/pgSQL function in place, defined as sendShipment() which takes (among other things) a sender and a receiver customer ID.

There is a business constraint around this which requires us to verify that both sender and receiver sit in the same region - and we need to do this as part of sendShipment(). So from within this function, we need to query the customer table for both the sender and receiver ID and verify that both their region ID is identical. We will also need to ID itself for further processing down the line.

So maybe something like this: SELECT DISTINCT region FROM customers WHERE id IN (?, ?)

The problem with this is that the result will be either an array (if the customers are not within the same region) or a single value.

Is there are more elegant way of solving this constraint? I was thinking of SELECT INTO and use a temporary table, or I could SELECT COUNT(DISTINCT region) and then do another SELECT for the actual value if the count is less than 2, but I'd like to avoid the performance hit if possible.

"I'd like to query the table for two rows at a time by ID" - _hold on_, why are you even doing that? Why not do a _single_ query with windowing-functions? that will be a lot faster — Dai, Mar 11 '22 at 00:02
What does your table actually represent? What is your end-goal? What happens if 3 or more rows share the same `name` value? What if those 3 or more rows have non-consecutive `id` values? — Dai, Mar 11 '22 at 00:03
@Dai The actual data is not too different from this, mostly UUIDS instead of string values. It doesn't matter how many of the rows present in the table share the same `name` value, you can assume there will be thousands with identical values. As part of a larger function, there needs to be a validation that both person A and person B share the same name. — ptye, Mar 11 '22 at 00:08
@Dai I updated my question, hopefully it's a bit clearer now what I'm trying to achieve. — ptye, Mar 11 '22 at 00:20

score 1 · Answer 1 · edited Mar 11 '22 at 06:09

There is also a PL/pgSQL function in place, defined as sendShipment() which takes (among other things) a sender and a receiver customer ID.

There is a business constraint around this which requires us to verify that both sender and receiver sit in the same region - and we need to do this as part of sendShipment(). So from within this function, we need to query the customer table for both the sender and receiver ID and verify that both their region ID is identical. We will also need to ID itself for further processing down the line.

This query should work:

WITH q AS (
    SELECT
        COUNT( * ) AS CountCustomers,
        COUNT( DISTINCT c.Region ) AS CountDistinctRegions,
--      MIN( c.Region ) AS MinRegion
        FIRST_VALUE( c.Region ) OVER ( ORDER BY c.Region ) AS MinRegion
    FROM
        Customers AS c
    WHERE
        c.CustomerId = $senderCustomerId
        OR
        c.CustomerId = $receiverCustomerId
)
SELECT
    CASE WHEN q.CountCustomers = 2 AND q.CountDistinctRegions = 2 THEN 'OK' ELSE 'BAD' END AS "Status",
    CASE WHEN q.CountDistinctRegions = 2 THEN q.MinRegion END AS SingleRegion
FROM
    q

The above query will always return a single row with 2 columns: Status and SingleRegion.
SQL doesn't have a "SINGLE( col )" aggregate function (i.e. a function that is NULL unless the aggregation group has a single row), but we can abuse MIN (or MAX) with a CASE WHEN COUNT() in a CTE or derived-table as an equivalent operation.
- ~~Alternatively, windowing-functions could be used, but annoyingly they don't work in GROUP BY queries despite being so similar, argh~~.
  - Once again, this is the ISO SQL committee's fault, not PostgreSQL's.
As your Region column is UUID you cannot use it with MIN, but I understand it should work with FIRST_VALUE( c.Region ) OVER ( ORDER BY c.Region ) AS MinRegion.
As for the columns:
- The Status column is either 'OK' or 'BAD' based on those business-constraints you mentioned. You might want to change it to a bit column instead of a textual one, though.
- The SingleRegion column will be NOT NULL (with a valid region) if CountDistinctRegions = 2 regardless of CountCustomers, but feel free to change that, just-in-case you still want that info.

Interesting. I think this would probably solve the trouble I've been having. Another approach I had just thought of was to use a regular `select distinct` and feed it directly into a for loop as described here: https://www.postgresql.org/docs/current/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING - then simply check if we iterate more than once. It's not pretty, I expect your query will perform better. — ptye, Mar 11 '22 at 00:39
One more thing - `MIN` and `MAX` won't work in case the IDs are of type `UUID`. Which I'm afraid they are. I expect `FIRST_VALUE` might be of some help though . — ptye, Mar 11 '22 at 00:49
Thank you for the update. I will mark the question as answered. In the mean time, I think I found another way to do it, like this `SELECT r.region from customers s inner join customers r on s.region = r.region where s.id = 'sender_id' and r.id = 'receiver_id';` - problem is it requires an inner join, I'm not sure how bad the performance impact would be with this given a large table. — ptye, Mar 11 '22 at 01:03
@ptye The size of the table is irrelevant: just ensure you have at least 1 index covering the columns used in the query, e.g. `CREATE INDEX IX_regions ON customers ( id, region )` (the order of the columns is important, so also try `( region, id )` (reversed order) if you get a poor `EXPLAIN` result. — Dai, Mar 11 '22 at 01:20
@ptye I'll admit that I do admire the simplicity of your query though. I honestly didn't think of doing anything like that. — Dai, Mar 11 '22 at 01:26
@ptye You really should post your query as its own answer, un-accept mine, and accept that instead, because it is a different solution. — Dai, Mar 11 '22 at 01:27
Looks like I'm unable to mark my own answer for another 2 days - but thanks again! Really appreciate your time and effort. — ptye, Mar 11 '22 at 09:23

score 1 · Answer 2 · answered Mar 11 '22 at 09:22

For anybody else who's interested in a simple solution, I finally came up with the (kind of obvious) way to do it:

SELECT
    r.region
FROM
    customers s
INNER JOIN customers r ON 
    s.region = r.region 
WHERE s.id = 'sender_id' and r.id = 'receiver_id';

Huge credit to SELECT DISTINCT to return at most one row who helped me out a lot on this and also posted a viable solution.

SELECT DISTINCT to return at most one row

2 Answers2