How to randomly subsample on a set of unique documents from a sql database?

Question

I have a sql table like below

So the id column contains the unique elements a, b, and c. I would like to choose at random (uniformly) 2 out of a, b, and c and then take all rows whose id is either one of these 2 elements. For example, if a and b are chosen. Then my desired result is

I have read this thread, but it is about subsample uniformly on the set of all rows. On the other hand, my subsample is extracted from the set of unique elements of a column id and then take all rows whose id belongs to this subsample.
My work is on Sagemaker of AWS, so I can extract the whole dataframe and perform subsample and extraction by Python. However, I guess this is slower than performing subsample directly by sql.

Could you elaborate on how to efficiently subsample in this case?

It does not make any sense to have duplicate IDs. Let's say the random function returns `ID='a'`, which row am I supposed to display? — Christophe, Nov 30 '21 at 10:36
@Christophe `id` is just a name of a column. The random function can return the row number to identify the chosen row. — Akira, Nov 30 '21 at 10:38

Slava Rozhnev · Accepted Answer · 2021-12-01T11:06:29.827

1

in MS SQL Server you can use next trick:

SELECT * FROM t WHERE id in (
    -- select 2 random id
    SELECT top 2 id FROM t GROUP by id ORDER BY NEWID()
);

MSSQL fiddle

I mean this query can be easy adopted to other RDBMS versions

edited Dec 01 '21 at 11:06

answered Nov 30 '21 at 10:47

Slava Rozhnev

9,510
6
23
39

Could you explain what is the purpose of `MIN(id)`? – Akira Nov 30 '21 at 20:01
`MIN` in this case can be changed to `MAX` it only served to get single value within group – Slava Rozhnev Nov 30 '21 at 21:44
I'm confused. Because we `GROUP by id`, each group has exactly 1 id. Is it redundant to use `MIN(id)`? – Akira Nov 30 '21 at 21:49
The inner query needs for return 2 random `id`, i can not use `distinct` with order by `newid()`, so i used `group by` and some aggregation for get distinct ids. May be you have another way to do this, i'll gad to know how :) – Slava Rozhnev Dec 01 '21 at 09:10
I meant if I remove `MIN(id)` from your code, it still works fine and returns 2 distinct random id's :) Please see [here](https://sqlize.online/sql/mssql2019/b434ba638959853408f8776e3157e347/). – Akira Dec 01 '21 at 09:23
Nice. Just updated my answer – Slava Rozhnev Dec 01 '21 at 11:06

Christophe · Answer 2 · 2021-12-01T09:25:35.907

1

You can use this :

SELECT rownum, id, col1
FROM (SELECT  row_number() OVER () rownum, id, col1
    FROM my_table 
    ORDER BY id) b
 WHERE rownum = round(random() * 4) + 1;

dbFiddle
random() returns a value between 0 and 1 so you need to multiply to a factor that gives a large enough range. Round it and ad 1 to have an integer and 1 as lowest value.

edited Dec 01 '21 at 09:25

answered Nov 30 '21 at 10:50

Christophe

651
4
22

How to randomly subsample on a set of unique documents from a sql database?

2 Answers2