SELECT fixed number of rows by evenly skipping rows

Question

I am trying to write a query which returns an arbitrary sized representative sample of data. I would like to do this by only selecting n^th rows where n is such that the entire result set is as close as possible to an arbitrary size.

I want this to work in cases where the result set would normally be less than the arbitrary size. In such a case, the entire result set should be returned.

I found this question which shows how to select every n^th row.

Here is what I have so far:

SELECT * FROM (
   SELECT *, ((row_number() OVER (ORDER BY "time"))
               % ceil(count(*)::FLOAT / 500::FLOAT)::BIGINT) AS rn
   FROM data_raw) sa
WHERE sa.rn=0;

This results in the following error:

ERROR: column "data_raw.serial" must appear in the GROUP BY clause or be used in an aggregate function Position: 23

Removing the calculation for n like this works:

SELECT * FROM (
   SELECT *, (row_number() OVER (ORDER BY "time"))
              % 50 AS rn FROM data_raw) sa
LIMIT 500;

I also tried moving the calculation to the WHERE clause:

SELECT * FROM (
   SELECT *, (row_number() OVER (ORDER BY "time")) AS rn
   FROM data_raw) sa
WHERE (sa.rn % ceil(count(*)::FLOAT / 500::FLOAT)::BIGINT)=0;

That too results in an error:

ERROR: aggregate functions are not allowed in WHERE Position: 108

Does anyone have any ideas on either how to fix my query or a better way to do this?

I have also thought about using random numbers and probability to select rows, but I would rather do something deterministic without the possibility of clumping.

BTW, pg 9.4? You are using the beta-release? – Erwin Brandstetter Nov 19 '14 at 12:13 — Erwin Brandstetter, Nov 19 '14 at 12:13

score 1 · Answer 1 · answered Nov 19 '14 at 09:37

1

You should make that calculation a subquery:

WHERE rn % (SELECT CEIL(COUNT(*)::FLOAT / 500:FLOAT)::BIGINT FROM data_raw) = 0

This way, it is no longer seen as an aggregate function, but as a scalar query.

answered Nov 19 '14 at 09:37

Alexander

19,906
19
75
162

score 1 · Accepted Answer · edited May 23 '17 at 11:57

The mistake in your first attempt is that you can't mix the aggregate function count(*) with the un-aggregated selection of rows. You can fix this by using count() as window-aggregate function instead:

SELECT * FROM (
   SELECT *, ((row_number() OVER (ORDER BY "time"))
               % ceil(count(*) OVER () / 500.0)::int) AS rn
   FROM   data_raw
   ) sub
WHERE sub.rn = 0;

Detailed explanation here:

Best way to get result count before LIMIT was applied

@Alexander has a fix for your last attempt.

SELECT fixed number of rows by evenly skipping rows

2 Answers2

Linked