1

I'm following this question on extracting a random subset of rows.

My data look like:

scenario   urban_areas_simple       place      population
North       Primary Urban Areas     Leeds      700,000
South       Primary Urban Areas     London     9,000,000
Scotland    Rural                   Shetland   22,000
...         ...                     ...

Using dplyr I have the following code, which works, and randomly selected 4 rows, based on conditions in my scenario and urban_areas_simple columns:

filter(lads, 
    scenario == "north" & urban_areas_simple == "Primary Urban Areas") %>% 
    sample_n(4) 

However, I also want to randomised the number of rows selected, as here I've only arbitrarily selected 4 as an example.

How would I randomly select rows meeting these conditions, for subsets of a random size?

NB: there may only be between 10-50 rows meeting each condition.

Thirst for Knowledge
  • 1,606
  • 2
  • 26
  • 43

2 Answers2

1

Instead of 4, you can use sample(1:100, size = 1). This will pick a random number between 1 and 100. If you want to make the process reproducible, stick a set.seed(x) before you start using any function which depends on a random seed. x is any integer.

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • This worked and yes, I've already set the seed, I just forgot to put it in the question. Is there anyway to avoid/catch/handle the error when the random number is larger than the number of observations in the data? – Thirst for Knowledge Oct 27 '17 at 09:46
  • @ThirstforKnowledge, that's why Robin's comment is quite nice. You could use `mtcars %>% sample_n(sample(1:nrow(.), 1))`, but it won't deal with grouped data.frames correctly. – Axeman Oct 27 '17 at 09:51
  • @ThirstforKnowledge you can use `tryCatch`. It may not be the most appropriate solution in this case, but it's very versatile. You could generate the vector of possible values with `1:nrow(x)`. – Roman Luštrik Oct 28 '17 at 08:52
0
filter(lads, 
  scenario == "north" & urban_areas_simple == "Primary Urban Areas") %>% 
  sample_frac(runif(1)) 

does just that.

The value is guaranteed to be returnable and it can handle stratified sampling from a grouped dataframe with unequal group sizes.

Robin Gertenbach
  • 10,316
  • 3
  • 25
  • 37