I have a sql table like below
So the id
column contains the unique elements a
, b
, and c
. I would like to choose at random (uniformly) 2 out of a
, b
, and c
and then take all rows whose id
is either one of these 2 elements. For example, if a
and b
are chosen. Then my desired result is
I have read this thread, but it is about subsample uniformly on the set of all rows. On the other hand, my subsample is extracted from the set of unique elements of a column
id
and then take all rows whoseid
belongs to this subsample.My work is on Sagemaker of AWS, so I can extract the whole dataframe and perform subsample and extraction by Python. However, I guess this is slower than performing subsample directly by sql.
Could you elaborate on how to efficiently subsample in this case?