I have table_1, that has data such as:
Range Start Range End Frequency
10 20 90
20 30 68
30 40 314
40 40 191 (here, it means we have just 40 as data point repeating 191 times)
table_2:
group value
10 56.1
10 88.3
20 53
20 20
30 55
I need to get the stratified sample on the basis of range from table_1, the table_2 can have millions of rows but the result should be restricted to just 10k points.
Tried below query:
SELECT
d.*
FROM
(
SELECT
ROW_NUMBER() OVER(
PARTITION BY group
ORDER BY group
) AS seqnum,
COUNT(*) OVER() AS ct,
COUNT(*) OVER(PARTITION BY group) AS cpt,
group, value
FROM
table_2 d
) d
WHERE
seqnum < 10000 * ( cpt * 1.0 / ct )
but a bit confused with the analytics functions usage here.
Expecting 10k records as a stratified sample from table_2:
Result table:
group value
10 56.1
20 53
20 20
30 55