I am trying to write a query which returns an arbitrary sized representative sample of data. I would like to do this by only selecting nth rows where n is such that the entire result set is as close as possible to an arbitrary size.
I want this to work in cases where the result set would normally be less than the arbitrary size. In such a case, the entire result set should be returned.
I found this question which shows how to select every nth row.
Here is what I have so far:
SELECT * FROM (
SELECT *, ((row_number() OVER (ORDER BY "time"))
% ceil(count(*)::FLOAT / 500::FLOAT)::BIGINT) AS rn
FROM data_raw) sa
WHERE sa.rn=0;
This results in the following error:
ERROR: column "data_raw.serial" must appear in the GROUP BY clause or be used in an aggregate function Position: 23
Removing the calculation for n
like this works:
SELECT * FROM (
SELECT *, (row_number() OVER (ORDER BY "time"))
% 50 AS rn FROM data_raw) sa
LIMIT 500;
I also tried moving the calculation to the WHERE clause:
SELECT * FROM (
SELECT *, (row_number() OVER (ORDER BY "time")) AS rn
FROM data_raw) sa
WHERE (sa.rn % ceil(count(*)::FLOAT / 500::FLOAT)::BIGINT)=0;
That too results in an error:
ERROR: aggregate functions are not allowed in WHERE Position: 108
Does anyone have any ideas on either how to fix my query or a better way to do this?
I have also thought about using random numbers and probability to select rows, but I would rather do something deterministic without the possibility of clumping.