1

We are attempting to pull a semi-random row from Oracle. (We don't need perfectly random row that meets rigorous statistical scrutiny but we would like something that has a chance of getting any row in the table even though there may be some degree of skew.)

We are using this approach:

SELECT PERSON_ID FROM ENCOUNTER SAMPLE(0.0001) WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020 AND ROWNUM = 1

This approach appears to be giving us just one random result each time we run it.

However, according to answers to this question, this approach gives results from the beginning of the table far more commonly.

How commonly? If that statement is true then how much more commonly are values taken from the top of the table? Our typical table has tens of millions of rows (occasionally billions.) Is there a simple heuristic or a rough estimate to understand the skew in the distribution we can expect?

We are asking for skew because other methods aren't fast enough for our use case. We are avoiding using ORDER because the source tables can be so large (i.e. billions of rows) that the reporting server will run for hours or can time out before we get an answer. Thus, our constraint is we need to use approaches like SAMPLE that respond with little database overhead.

Praxiteles
  • 5,802
  • 9
  • 47
  • 78
  • what is meant by a semi-random row? – George Joseph Apr 19 '20 at 16:33
  • What is the definition of "top of the table"? And, in any case - do you really care what that distribution is, or should you instead ask how to get a more uniform distribution of your single, quasi-randomly generated row? I would think this latter question is what you really care about. –  Apr 19 '20 at 16:39
  • @GeorgeJoseph added clarification to the question – Praxiteles Apr 19 '20 at 22:38
  • @mathguy If we can get a more uniform distribution - that would be great - but it needs to have little database overhead. Some of our tables have billions of rows - and ORDER BY will time out. That is why we are asking about the skew we can expect - because we know this query we show is fast and doesn't time out. – Praxiteles Apr 19 '20 at 22:39
  • Gordon Linoff showed you how to do it. `ORDER BY` is applied only AFTER you sample the table; let's say you sample 2000 rows out of your billions, and those are uniformly distributed over the table; then `ORDER BY ` and select just one row, here you are only ordering 2000 rows, not billions of rows. Now, if even that is too much, there is an even more efficient solution, I will put it in an answer just for fun, but Gordon's solution should be good enough. –  Apr 19 '20 at 22:56

3 Answers3

1

The issue than sample is basically going through the table in order and randomly selecting rows. The issue is the rownum, not the sample.

The solution is to use sample and then randomly sort:

SELECT p.*
FROM (SELECT PERSON_ID
      FROM ENCOUNTER SAMPLE(0.0001)
      WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020
      ORDER BY dbms_random.value
     ) p
WHERE ROWNUM = 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Regretfully - we are unable to use ORDER due to the size of the tables (millions to billions of rows.) We need something with low database overhead. All the answers in the other question we linked used methods like this one that have too much overhead. That is why the focus of this question is on the approach we described - and asking about how to understand the skew of the results. – Praxiteles Apr 19 '20 at 22:37
  • 1
    @Praxiteles - did you test what you wrote above? I believe sampling happens first, before ORDER BY is applied just to the sample; you seem to believe that ordering is done on the entire table and only then is the sampling done. Why do you believe that? –  Apr 19 '20 at 22:58
  • You are right. I looked at your code too fast and it had the form of other answers on the other question that used ORDER by in the inner clause against every row. I better understand your solution now. Thanks! – Praxiteles Apr 20 '20 at 05:09
1

Just for fun, here is an alternative way to select a single, uniformly distributed row out of a (uniformly distributed) "small" sample of rows from the table.

Suppose the table has millions or billions of rows, and we use the sample clause to select only a small, random (and presumably uniformly distributed) sample of rows. Let's say the sample size is 200 rows. How can we select a single row out of those 200, in such a way that the selection is not biased?

As the OP explained, if we always select the first row generated in the sample, that has a very high likelihood to be biased. Gordon Linoff has shown a perfectly valid way to fix that. Here I describe a different approach - which is even more efficient, as it only generates a single random number, and it does not need to order the 200 rows. (Admittedly this is not a lot of overhead, but it may still matter if the query must be run many times.)

Namely: Given any 200 rows, generate a (hopefully uniformly distributed) single integer between 1 and 200. Also, as the 200 rows are generated, capture ROWNUM at the same time. Then it's as simple as selecting the row where ROWNUM = <the randomly generated integer>

Unfortunately, the sample clause doesn't generate a fixed number of rows, even if the table and the percentage sampled are fixed (and even if stats on the table are current). So the solution is just slightly more complicated - first I generate the sample, then I count how many rows it contains, and then I select the one row we want.

The output will include a column for the "random row number"; if that is an issue, just list the columns from the base table instead of * in the final query. I assume the name of the base table is t.

with
  p as ( select t.*, rownum as rn 
         from   t   sample(0.0001)
       )
, r as ( select trunc(dbms_random.value(1, (select count(*) from p) + 1)) as rn
         from   dual
       )
select p.*
from   p join r on p.rn = r.rn
;
1

It's not accurate to say "[SAMPLE] gives results from the beginning of the table far more commonly," unless you're using SAMPLE wrong. However, there are some unusual cases where earlier rows are favored if those early rows are much larger than subsequent rows.

SAMPLE Isn't That Bad

If you use a large sample size, the first rows returned do appear to come from the "first" rows of the table. (But tables are unordered, and while I observe this behavior on my machine there is no guarantee you will always see it.)

The below query does seem to do a good job of picking random rows, but not if you only look at the first N rows returned:

select * from test1 sample(99);

SAMPLE Isn't Perfect Either

The below test case shows how the row size can skew the results. If you insert 10,000 large rows and then insert 10,000 small rows, a small SAMPLE will almost always only return large rows.

--drop table test1 purge;
create table test1(a varchar2(5), b varchar2(4000));

--Insert 10K large records.
insert into test1 select 'large', lpad('A', 4000, 'A') from dual connect by level <= 10000;

--Insert 10K small records.
insert into test1 select 'small', null from dual connect by level <= 10000;

--Select about 10 rows. Notice that they are almost always a "LARGE" row.
select * from test1 sample (0.1);

However, the skew completely disappears if you insert the small rows before the large rows.

I think these results imply that SAMPLE is based on the distribution of data in blocks (8 KB of data), and not strictly random per rows. If small rows are "hidden" in a physically small part of the table they are much less likely to show up. However, Oracle always seems to check the first part of the table, and if the small rows exist there, then the sample is evenly distributed. The rows have to be hiding very well to be missed.

The real answer depends on Oracle's implementation, which I don't have access to. Hopefully this test case will at least give you some ideas to play around and determine if SAMPLE is random enough for your needs.

Jon Heller
  • 34,999
  • 6
  • 74
  • 132