Selected randomly-sized, random subsets of rows

Question

I'm following this question on extracting a random subset of rows.

My data look like:

scenario   urban_areas_simple       place      population
North       Primary Urban Areas     Leeds      700,000
South       Primary Urban Areas     London     9,000,000
Scotland    Rural                   Shetland   22,000
...         ...                     ...

Using dplyr I have the following code, which works, and randomly selected 4 rows, based on conditions in my scenario and urban_areas_simple columns:

filter(lads, 
    scenario == "north" & urban_areas_simple == "Primary Urban Areas") %>% 
    sample_n(4)

However, I also want to randomised the number of rows selected, as here I've only arbitrarily selected 4 as an example.

How would I randomly select rows meeting these conditions, for subsets of a random size?

NB: there may only be between 10-50 rows meeting each condition.

@ThirstforKnowledge what exactly happens when you attempt Robin's solution? It works fine for me. — Eumenedies, Oct 27 '17 at 09:54
Actually, that was my mistake, it's now working with Robin's solution. Do you want to post a proper answer @RobinGertenbach? — Thirst for Knowledge, Oct 27 '17 at 09:57
Done. I didn't think it was worth adding to Roman's answer but forgot about the grouping benefit. — Robin Gertenbach, Oct 27 '17 at 10:42

score 1 · Answer 1 · answered Oct 27 '17 at 09:38

1

Instead of 4, you can use sample(1:100, size = 1). This will pick a random number between 1 and 100. If you want to make the process reproducible, stick a set.seed(x) before you start using any function which depends on a random seed. x is any integer.

answered Oct 27 '17 at 09:38

Roman Luštrik

69,533
24
154
197

This worked and yes, I've already set the seed, I just forgot to put it in the question. Is there anyway to avoid/catch/handle the error when the random number is larger than the number of observations in the data? – Thirst for Knowledge Oct 27 '17 at 09:46
@ThirstforKnowledge, that's why Robin's comment is quite nice. You could use `mtcars %>% sample_n(sample(1:nrow(.), 1))`, but it won't deal with grouped data.frames correctly. – Axeman Oct 27 '17 at 09:51
@ThirstforKnowledge you can use `tryCatch`. It may not be the most appropriate solution in this case, but it's very versatile. You could generate the vector of possible values with `1:nrow(x)`. – Roman Luštrik Oct 28 '17 at 08:52

score 0 · Accepted Answer · answered Oct 27 '17 at 10:41

filter(lads, 
  scenario == "north" & urban_areas_simple == "Primary Urban Areas") %>% 
  sample_frac(runif(1))

does just that.

The value is guaranteed to be returnable and it can handle stratified sampling from a grouped dataframe with unequal group sizes.

Selected randomly-sized, random subsets of rows

2 Answers2