0

I have table_1, that has data such as:

Range Start   Range End   Frequency   
10                   20          90   
20                   30          68   
30                   40         314   
40                   40         191 (here, it means we have just 40 as data point repeating 191 times) 

table_2:

group     value   
10        56.1   
10        88.3   
20        53   
20        20   
30        55   

I need to get the stratified sample on the basis of range from table_1, the table_2 can have millions of rows but the result should be restricted to just 10k points.

Tried below query:

SELECT   
    d.*   
FROM   
    (   
        SELECT   
            ROW_NUMBER() OVER(   
                                PARTITION BY group   
                                ORDER BY group   
                            ) AS seqnum,   
            COUNT(*) OVER() AS ct,   
            COUNT(*) OVER(PARTITION BY group) AS cpt,   
            group, value   
        FROM   
            table_2 d   
    ) d   
WHERE   
    seqnum < 10000 * ( cpt * 1.0 / ct )   

but a bit confused with the analytics functions usage here.

Expecting 10k records as a stratified sample from table_2:

Result table:

group     value   
10       56.1   
20       53   
20        20   
30       55
  • Why there is only one record of group 10 in result table? – Popeye Aug 22 '19 at 08:00
  • the result table is just for an example. Basically, the result table should contain the subset(stratified sample) of records from table_2 – Surendra Singh Aug 22 '19 at 08:03
  • 1
    Are the 10000 results supposed to be proportional based on the frequencies, so there are ~5 times as many values from the 30-40 range as there are from the 20-30 range (i.e. 314/68 = 4.617... time as many)? Also does a row with group 20 get counted in the 10-20 range, or the 20-30 range, or both? (One might assume that you include values >= start and < end; but then 40-40 would be empty...) – Alex Poole Aug 22 '19 at 09:23
  • @SurendraSingh . . . Your question doesn't make sense. Your first table has a total frequency of 663, but you are asking for 10,000 rows. Where do the remaining 9,337 rows come from? – Gordon Linoff Aug 22 '19 at 11:03
  • Why don't you use `sample clause`: https://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_10002.htm#i2065953 ? – Dr Y Wit Aug 22 '19 at 16:18

2 Answers2

1

It means you need atleast one record of each group and more records on random basis then try this:

SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE, 
ROW_NUMBER() 
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1 
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;

Here, Rownum is taken on random basis for each group and then result is taking rownum 1 and other rownum if they fulfill random condition.

Cheers!!

Popeye
  • 35,427
  • 4
  • 10
  • 31
0

If I understand what you want - which is by no means certain - then I think you want to get a maximum of 10000 rows, with the number of group values proportional to the frequencies. So you can get the number of rows you want from each range with:

select range_start, range_end, frequency,
  frequency/sum(frequency) over () as proportion,
  floor(10000 * frequency/sum(frequency) over ()) as limit
from table_1;

RANGE_START  RANGE_END  FREQUENCY PROPORTION      LIMIT
----------- ---------- ---------- ---------- ----------
         10         20         90 .135746606       1357
         20         30         68 .102564103       1025
         30         40        314 .473604827       4736
         40         40        191 .288084465       2880

Those limits don't quite add up to 10000; you could go slightly above with ceil instead of floor.

You can then assign a nominal row number to each entry in table_2 based on which range it is in, and then restrict the number of rows from that range via that limit:

with cte1 (range_start, range_end, limit) as (
  select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
  from table_1
),
cte2 (grp, value, limit, rn) as (
  select t2.grp, t2.value, cte1.limit,
    row_number() over (partition by cte1.range_start order by t2.value) as rn
  from cte1
  join table_2 t2
  on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
  or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;

...

9998 rows selected.

I've used order by t2.value in the row_number() call because it isn't clear how you want to pick which rows in the range you actually want; you might want to order by dbms_random.value or something else.

db<>fiddle with some artificial data.

Alex Poole
  • 183,384
  • 11
  • 179
  • 318