Randomly sample a percentage of rows within a data frame

Question

Related to this question.

gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age    <- c(23, 25, 27, 29, 31, 33, 35, 37)
mydf <- data.frame(gender, age) 

mydf[ sample( which(mydf$gender=='F'), 3 ), ]

Instead of selecting a number of rows (3 in above case), how can I randomly select 20% of rows with "F"? So of the five rows with "F", how do I randomly sample 20% of those rows.

score 22 · Answer 1 · answered Apr 07 '17 at 03:31

22

You can use sample_frac() function in dplyr package.

e.g. If you want to sample 20 % within each group:

mydf %>% sample_frac(.2)

If you want to sample 20 % within each gender group:

mydf %>% group_by(gender) %>% sample_frac(.2)

answered Apr 07 '17 at 03:31

Zhen Liang

351
2
5

Ben · Accepted Answer · 2013-02-22T18:50:08.460

15

How about this:

mydf[ sample( which(mydf$gender=='F'), round(0.2*length(which(mydf$gender=='F')))), ]

Where 0.2 is your 20% and length(which(mydf$gender=='F')) is the total number of rows with F

edited Feb 22 '13 at 18:50

answered Feb 22 '13 at 18:40

Ben

41,615
18
132
227

2

+1, but do mind that 20% can be something other than an integer, so using round would be needed. – Paul Hiemstra Feb 22 '13 at 18:44
1

good point, thanks, I've added that in. By the way, you're missing a comma and close square bracket in your answer – Ben Feb 22 '13 at 18:51

score 3 · Answer 3 · answered Feb 25 '13 at 07:46

Self-promotion alert. I wrote a function that allows convenient stratified sampling, and I've included an option to subset levels from the grouping variables before sampling.

The function is called stratified and can be used in the following ways:

set.seed(1)
# Proportional sample
stratified(mydf, group="gender", size=.2, select=list(gender = "F"))
#   gender age
# 4      F  29
# Fixed-size sampling
stratified(mydf, group="gender", size=2, select=list(gender = "F"))
#   gender age
# 4      F  29
# 5      F  31

You can specify multiple groups (for example if your data frame included a "state" variable and you wanted to group by "state" and "gender" you would specify group = c("state", "gender")). You can also specify multiple "select" arguments (for example, if you wanted only female respondents from California and Texas, and your "state" variable used two-letter state abbreviations, you could specify select = list(gender = "F", state = c("CA", "TX"))).

The function itself can be found here or you can download and install the package (which gives you convenient access to the help pages and examples) by using install_github from the "devtools" package as follows:

# install.packages("devtools")
library(devtools)
install_github("mrdwabmisc", "mrdwab")

Hi the url link seems to be dead – agenis Apr 05 '19 at 14:32 — agenis, Apr 05 '19 at 14:32

Paul Hiemstra · Answer 4 · 2013-02-22T18:53:24.490

2

To sample 20%, you can use this to get the sample size:

n = round(0.2 * nrow(mydf[mydf$gender == "F",]))

edited Feb 22 '13 at 18:53

answered Feb 22 '13 at 18:41

Paul Hiemstra

59,984
12
142
149

Yeah, I was able to do that but this is a file that's automated and run every hour so I can't really go in and adjust the values w/o writing another function w/ an if else statement. Figured there'd be a simpler approach – ATMathew Feb 22 '13 at 18:43
3

This is exactly the answer to your question, if your question is different, please edit in more details. – Paul Hiemstra Feb 22 '13 at 18:52
Anyone care to comment on the downvote? This answer exactly answers the question. – Paul Hiemstra Feb 22 '13 at 18:59
@PaulHiemstra, I am not sure if this is the answer to this question, seems this only return the number of rows not the actual data, Could you please advise, seems Ben's answer returns the expected result – Mohsen Sichani Mar 28 '19 at 05:44

Randomly sample a percentage of rows within a data frame

4 Answers4

Linked

Related