R: How to subset a dataframe by condition AND random at the same time?

Question

I am struggling with a subset function and hope that I can get some help from the cloud.

Within my dataset surveydata one can find the column landscape. I need to both select all landscapes of type 7 and 5 and to randomly select 50 objects from each landscape type 3 and 6. Then I want to create a new variable called sub in the surveydata dataframe which should contain a number e.g. 1 if the object was selected in the previous selection and 0 (or NA) if it wasn't.

Preferably I search a base R solution, but I don't stick to that.

I provide a random dataset for better understanding.

#create data
surveydata <- as.data.frame(replicate(6,sample(0:1,1000,rep=TRUE)))
# change values of columns
surveydata$V3 <- sample(7, size = nrow(surveydata), replace = TRUE)
surveydata$V4 <- sample(5, size = nrow(surveydata), replace = TRUE)
surveydata$V5 <- sample(5, size = nrow(surveydata), replace = TRUE)
surveydata$V6 <- sample(5, size = nrow(surveydata), replace = TRUE)
#create column with same distribution of values
surveydata$group <- c(1,2)
# rename columns
colnames(surveydata)[1] <- "gender"
colnames(surveydata)[2] <- "expert"
colnames(surveydata)[3] <- "landscape"
colnames(surveydata)[4] <- "q1"
colnames(surveydata)[5] <- "q2"
colnames(surveydata)[6] <- "q3"

Does this answer your question? [Stratified random sampling from data frame](https://stackoverflow.com/questions/23479512/stratified-random-sampling-from-data-frame) — Peter O., Jul 08 '20 at 12:47

Allan Cameron · Accepted Answer · 2020-07-08T13:30:46.850

Here's an R method which uses sampling and indexing to achieve the results:

# Sample index of rows where landscape is 3 or 6
ss <- sample(with(surveydata, which(landscape == 6 | landscape == 3)), 50, FALSE)

# Append index of all rows where landscape is 5 or 7
ss <- c(ss, with(surveydata, which(landscape == 5 | landscape == 7)))

# Create subset data frame
subset <- surveydata[ss, ]

# Create sub column to show which rows have been sampled
surveydata$sub <- numeric(nrow(surveydata))
surveydata$sub[ss] <- 1

# test result of creating sub column
head(surveydata)
#>   gender expert landscape q1 q2 q3 group sub
#> 1      0      1         7  1  5  3     1   1
#> 2      1      1         5  2  2  3     2   1
#> 3      0      0         4  5  5  2     1   0
#> 4      0      0         3  5  5  4     2   0
#> 5      0      1         7  1  5  1     1   1
#> 6      1      0         7  5  1  1     2   1

# ensure subsetted data frame is as expected
head(subset)
#>     gender expert landscape q1 q2 q3 group
#> 348      0      0         6  5  4  2     2
#> 333      1      1         6  4  2  4     1
#> 521      1      0         6  1  5  5     1
#> 522      1      0         6  4  5  2     2
#> 563      0      1         6  2  4  2     1
#> 13       0      0         6  5  2  4     1

^{Created on 2020-07-08 by the reprex package (v0.3.0)}

Thanks @AllanCameron! Unfortunately the first line of your code does not select the 50 cases intented but selects all casses of landscapes 6 and 3. Did you observe that, too? — Boombardeiro, Jul 08 '20 at 13:23
@Boombardeiro sorry - that was caused by a typo on my first line. Please see my update. — Allan Cameron, Jul 08 '20 at 13:31

R: How to subset a dataframe by condition AND random at the same time?

1 Answers1