0

Short version, I have a SQL statement where I only want the results if the number of rows returned is less than some value (say 1000) and otherwise I want a different result set. What's the best way to do this without incurring the overhead of returning the 1000 rows (as would happen if I used limit) when I'm just going to throw them away?

For instance, I want to return the results of

SELECT * 
FROM T 
WHERE updated_at > timestamp 
  AND name <= 'Michael' 
ORDER BY name ASC

provided there are at most 1000 entries but if there are more than that I want to return

SELECT * 
FROM T 
ORDER BY name ASC 
LIMIT 25

Two queries isn't bad, but I definitely don't want to get 1000 records back from the first query only to toss them.

(Happy to use Postgres extensions too but prefer SQL)

--

To explain I'm refreshing data requested by client in batches and sometimes the client needs to know if there have been any changes in the part they've already received. If there are too many changes, however, I'm just giving up and starting to send the records from the start again.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Peter Gerdes
  • 2,288
  • 1
  • 20
  • 28
  • 1
    1000 rows is not a lot, so store your first query result in a temp table. Then you can check the #rows and simply use an `if` statement to return the data from the temp table or run the other query; you could also just store the primary keys in the temp table and use it to join back to `T` to return the required columns. – Stu May 22 '22 at 18:13
  • Could you explain why I would want to keep those rows in a temp table if I don't care about those rows when there are more than 1000 of them? It's not like the query I want to run instead in those cases cares about the result of the first query. I just want to toss it (too complex to manage batching deltas to client for a batched result). Maybe 1000 isn't a lot but client is unsecure so I have to be ready for malacious requests of arbitrary size and each row can have a fair bit of data in real table.. – Peter Gerdes May 22 '22 at 19:00
  • I mean, I could always just run a count first (in same transaction) and switch on result of that but it seems like a waste to execute the query twice, once to count and once to return if the count is small. But maybe that's the best option. – Peter Gerdes May 22 '22 at 19:04
  • 1
    you can't know in advance how many rows will be returned so you either have to count first and then if required re-run the query to return results, or run it once, store the list of PKs then if you do need to run it again you can just use the already found rows to return the data which should be faster. alternatively, *could* you maintain meta data and store the actoual count of matching rows and maintain the meta data for inserts/updates/deletes? then you only have to query the meta data to know what query to run,,, – Stu May 22 '22 at 19:10
  • Thx!! that's what I needed. FYI in this case it's a different query but I see your point. Unf, no bc the server is stateless so only client should know when it last requested refresh of data it fetched so far. If I run into a perf issue w/ counting then instead of only storing the last time client refreshed everything I'll keep a timestamp for each chunk of rows (names Adams to Bob, names Bart to Bashir etc) so I don't have to either start from the beginning or provide a delta for all rows fetched so far (then client will be able to handle delta which only covers subset) – Peter Gerdes May 22 '22 at 20:15

1 Answers1

2
WITH max1000 AS (
   SELECT the_row, count(*) OVER () AS total
   FROM  (
      SELECT the_row  -- named row type
      FROM   T AS the_row
      WHERE  updated_at > timestamp
      AND    name <= 'Michael'
      ORDER  BY name
      LIMIT  1001
      ) sub
   )
SELECT (the_row).*  -- parentheses required
FROM   max1000 m
WHERE  total < 1001

UNION ALL
(  -- parentheses required
SELECT * 
FROM   T 
WHERE (SELECT total > 1000 FROM max1000 LIMIT 1)
ORDER  BY name
LIMIT  25
)

The subquery sub in CTE max1000 gets the complete, sorted result for the first query - wrapped as row type, and with LIMIT 1001 to avoid excess work.

The outer SELECT adds the total row count. See:

The first SELECT of the outer UNION query returns decomposed rows as result - if there are less than 1001 of them.

The second SELECT of the outer UNION query returns the alternate result - if there were more than 1000. Parentheses are required - see:

Or:

WITH max1000 AS (
   SELECT *
   FROM   T
   WHERE  updated_at > timestamp
   AND    name <= 'Michael'
   ORDER  BY name
   LIMIT  1001
   )
, ct(ok) AS (SELECT count(*) < 1001 FROM max1000)   

SELECT *
FROM   max1000 m
WHERE (SELECT ok FROM ct)

UNION ALL
(  -- parentheses required
SELECT * 
FROM   T 
WHERE (SELECT NOT ok FROM ct)
ORDER  BY name
LIMIT  25
);

I think I like the 2nd better. Not sure which is faster.

Either optimizes performance for less than 1001 rows in most calls. If that's the exception, I would first run a somewhat cheaper count. Also depends a lot on available indexes ...

You get no row if the first query finds no row. (Seems like an odd result.)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks. Maybe I don't understand, why is it weird that you get no row if the first query (max1000?) finds no row? I mean, that's exactly what I want. I'm streaming data to the client and this is the query that's run when the client says, "Hey, I've got the results up to michael but can you tell me if anything changed in those results so I can update my display?" But I might have misunderstood. – Peter Gerdes May 23 '22 at 10:57