72

This question asks about getting a random(ish) sample of records on SQL Server and the answer was to use TABLESAMPLE. Is there an equivalent in Oracle 10?

If there isn't, is there a standard way to get a random sample of results from a query set? For example how can one get 1,000 random rows from a query that will return millions normally?

Community
  • 1
  • 1
Jeremy French
  • 11,707
  • 6
  • 46
  • 71

9 Answers9

87
SELECT  *
FROM    (
        SELECT  *
        FROM    mytable
        ORDER BY
                dbms_random.value
        )
WHERE rownum <= 1000
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • 19
    DBMS_RANDOM is PL/SQL, and there is no pure SQL way in Oracle to generate random numbers. All hail context switch. – Quassnoi Apr 09 '09 at 11:00
  • On a table with 40'000 rows, this query takes 0.1 second, when query based on `SAMPLE(n)` takes 0.02 second (5x faster). So for my amount of data, it's not that slow... – Julien Kronegg Jul 25 '19 at 05:19
78

The SAMPLE clause will give you a random sample percentage of all rows in a table.

For example, here we obtain 25% of the rows:

SELECT * FROM emp SAMPLE(25)

The following SQL (using one of the analytical functions) will give you a random sample of a specific number of each occurrence of a particular value (similar to a GROUP BY) in a table.

Here we sample 10 of each:

SELECT * FROM (
SELECT job, sal, ROW_NUMBER()
OVER (
PARTITION BY job ORDER BY job
) SampleCount FROM emp
)
WHERE SampleCount <= 10
Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
grokster
  • 5,919
  • 1
  • 36
  • 22
  • 3
    We do not get 25 % of rows most of the time. Instead we get some uncertain number of rows. And every row from our original set has the same chance (¼ here) of being selected. – Mr. Tao Jan 26 '17 at 17:31
  • 3
    The `SEED(n)` option allows you to get always the same sample, hence the same number of rows (this may be useful when you need reproducible results), e.g. `SELECT * from emp SAMPLE(25) SEED(1)`. – Julien Kronegg Jul 25 '19 at 05:12
  • Using a table with 40'000 rows and `SAMPLE(25)` i've seen 24.41% to 25.36% of records being returned. – Julien Kronegg Jul 25 '19 at 05:15
17

This in not a perfect answer but will get much better performance.

SELECT  *
FROM    (
    SELECT  *
    FROM    mytable sample (0.01)
    ORDER BY
            dbms_random.value
    )
WHERE rownum <= 1000

Sample will give you a percent of your actual table, if you really wanted a 1000 rows you would need to adjust that number. More often I just need an arbitrary number of rows anyway so I don't limit my results. On my database with 2 million rows I get 2 seconds vs 60 seconds.

select * from mytable sample (0.01)
Arturo Hernandez
  • 2,749
  • 3
  • 28
  • 36
  • 3
    If the sample is to be statistically unbiased, this is not a good approach. Since the result of the inner query is ordered, the first values will have a bigger chance to be picked than the latter ones. Just imagine how unlikely it is for the last register of the table to be picked! – Baumann Aug 21 '18 at 14:33
  • Because the inner query orders by an un-indexed value. The db has to calculate the value all the way to the last row, before returning the first row. Because it is just as likely to be picked first as any other record. – Arturo Hernandez May 09 '19 at 03:41
  • 1
    @Baumann : `dbms_random.value` is a random value generated by the database, thus the rows should not considered to be _really_ ordered (it's only to make every row selected by `SAMPLE(n)` to have the same probability of being in the 1000 returned rows). – Julien Kronegg Jul 25 '19 at 05:36
  • The first query combines the best of two worlds (fast random selection with `SAMPLE(n)` and precisely defined number of returned results with `rownum<=1000`). – Julien Kronegg Jul 25 '19 at 05:38
  • In my case I wanted to draw on the existing production table to create some realistic test data. With a tweak to the sample size this does exactly what I need. – Wildcat Matt Aug 25 '20 at 14:34
8

Sample function is used for sample data in ORACLE. So you can try like this:-

SELECT * FROM TABLE_NAME SAMPLE(50);

Here 50 is the percentage of data contained by the table. So if you want 1000 rows from 100000. You can execute a query like:

SELECT * FROM TABLE_NAME SAMPLE(1);

Hope this can help you.

Onur A.
  • 3,007
  • 3
  • 22
  • 37
Ankit Bajpai
  • 13,128
  • 4
  • 25
  • 40
8
SELECT * FROM TABLE_NAME SAMPLE(1)

Will give you olny an approximate 1% share rather than exactly 1/100 of the number of observations. The likely reason is than Oracle generates a random flag for each observation as to whether include in in the sample that it generates. The argument 1 (1%) in such a generation process takes the role of probability of each observation's being selected into the sample.

If this is true, the actual distribution of sample sizes will be binomial.

Hors2force
  • 101
  • 1
  • 2
8

I know this has already been answered, but seeing so many visits here I'd like to add one version that uses the SAMPLE clause but still allows to filter the rows first:

with cte1 as (
    select *
    from t_your_table
    where your_column = 'ABC'
)
select * from cte1 sample (5)

Note however that the base select needs a ROWID column, which means it may not work for some views for example.

Thomas Tschernich
  • 1,264
  • 15
  • 29
0

Something like this should work:

SELECT * 
FROM table_name
WHERE primary_key IN (SELECT primary_key 
                      FROM
                      (
                        SELECT primary_key, SYS.DBMS_RANDOM.RANDOM 
                        FROM table_name 
                        ORDER BY 2
                      )
                      WHERE rownum <= 10 );
Taryn
  • 242,637
  • 56
  • 362
  • 405
0

We were given and assignment to select only two records from the list of agents..i.e 2 random records for each agent over the span of a week etc.... and below is what we got and it works

with summary as (
Select Dbms_Random.Random As Ran_Number,
             colmn1,
             colm2,
             colm3
             Row_Number() Over(Partition By col2 Order By Dbms_Random.Random) As Rank
    From table1, table2
 Where Table1.Id = Table2.Id
 Order By Dbms_Random.Random Asc)
Select tab1.col2,
             tab1.col4,
             tab1.col5,
    From Summary s
 Where s.Rank <= 2;
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
0

Suppose you are trying to select exactly 1,000 random rows from a table called my_table. This is one way to do it:

select
    *
from
    (
        select
            row_number() over(order by dbms_random.value) as random_id,
            x.*
        from
            my_table x
    )
where
    random_id <= 1000
;

This is a slight deviation from the answer posted by @Quassnoi. They both have the same costs and execution times. The only difference is that you can select the random number used to fetch the sample.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76