0

In a PostgreSQL database, I have an extract of cases worked in any given month by all of the case handlers. I need to randomly select 4 cases per case handler to complete a check. Is there any way to do this in SQL?

Assuming the database is called casemanagement and the table is called cases, I would expect the query to obtain the full extract without the random sample to be something along the lines of:

SELECT a.case_id, a.case_handler, a.case_status
FROM casemanagement.cases a
WHERE a.case_status = 'Completed'
ORDER BY a.case_id;

If I were to add in a condition for a random sample with a limit, I assume the following would work for an individual case handler:

SELECT a.case_id, a.case_handler, a.case_status
FROM casemanagement.cases a
WHERE a.case_status = 'Completed' and a.case_handler = 'John Smith'
ORDER BY RANDOM()
LIMIT 4

But is it possible to have the limit of 4 without having to specify the case handler in the where function?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • About: *"Assuming the database is called `casemanagement`"*. You are confusing "database" with "schema". See: https://www.postgresql.org/docs/current/ddl-schemas.html – Erwin Brandstetter Jul 20 '23 at 23:33
  • Yes apologies, I did mean schema, as I said I’m a very basic user so not too familiar with the terminology! – NiknotNathanoj Jul 21 '23 at 08:09

1 Answers1

1

Assuming you also have a table case_handler with one row per case handler of interest (like you should!)

Then a LATERAL subquery does the job:

SELECT h.case_handler, c.case_id, c.case_status
FROM   casemanagement.case_handler h
LEFT   JOIN LATERAL (
   SELECT c.case_id, c.case_status
   FROM   casemanagement.cases c
   WHERE  c.case_handler = h.case_handler
   AND    c.case_status = 'Completed'
   ORDER  BY random()
   LIMIT  4
   ) c ON true;

See:

Assuming case_handler.case_handler to be the PRIMARY KEY. Adapt to your actual design.

The index associated with the PK can also help performance, if an index-only scan is possible. More importantly: have an index on casemanagement.cases(case_handler). For large tables and/or few 'Completed' cases, more optimization is possible.

If you don't have a table case_handler, consider creating it.
Or distill it for the purpose of this query alone in a CTE:

WITH case_handler AS (
   SELECT DISTINCT case_handler FROM casemanagement.cases
   )
SELECT h.case_handler, c.case_id, c.case_status
FROM   case_handler h ... -- rest like above

For many rows per handler, there are faster query styles to get distinct rows. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228