13

This question builds from the SO post found here and uses code that was modified from a post on the R-help mailing list which can be seen here

I am trying to extract a random sample of rows in a data frame but with a conditional. Using the R iris data which looks like:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa 

To take a simple random sample, the code below works fine to take a sample of 2 rows.

iris[sample(nrow(iris), 2), ]

However I am unsure how to condition the Species field. For example how to take the random sample as indicated above but only when Species != “setosa”

There are three categories of iris$Species

> summary(iris$Species)
    setosa versicolor  virginica 
        50         50         50

I am unsure how to correctly nest conditionals. One of my earlier attempts is below with the obviously incorrect results included….

> iris[sample(nrow(iris)[iris$Species != "setosa"], 2), ]
     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
NA             NA          NA           NA          NA    <NA>
NA.1           NA          NA           NA          NA    <NA>

Thanks

Community
  • 1
  • 1
B. Davis
  • 3,391
  • 5
  • 42
  • 78

4 Answers4

19

I'd use which to get the vector of rows numbers from which you can sample given your condition....

iris[ sample( which( iris$Species != "setosa" ) , 2 ) , ]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#59           6.6         2.9          4.6         1.3 versicolor
#133          6.4         2.8          5.6         2.2  virginica
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Slick. Thanks, I too often forget about which(). – B. Davis Nov 15 '13 at 02:27
  • @simon-ohanlon How can I apply this condition: take random sample with replacement but probabilities should be 80% versicolor and 20% virginica – Newbie Sep 14 '16 at 15:58
  • How can I make a train set with, say n, random row with a certain condition. For example, 10 random rows that stisfy iris$Species == "setosa", 10 random rows that stisfy iris$Species == "versicolor" and 10 random rows that stisfy iris$Species == "virginica"? – Mehdi Abbassi Mar 24 '20 at 18:03
  • 1
    @MehdiAbbassi Here: https://stackoverflow.com/a/23193184/1478381 `iris %>% group_by(Species) %>% do(sample_n(.,10))` – Simon O'Hanlon Mar 24 '20 at 18:33
  • Hi @SimonO'Hanlon. This was helpful. Can you show how to add two/multiple conditions / conditions from two or more columns instead of one? – rais Sep 10 '22 at 06:30
12

With dplyr:

library(dplyr)
set.seed(12)
filter(iris, Species != "setosa") %>% sample_n(., 2) 

Output:

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
7           6.3         3.3          4.7         1.6 versicolor
81          7.4         2.8          6.1         1.9  virginica
mpalanco
  • 12,960
  • 2
  • 59
  • 67
4

It'd be cleaner to not do it in one line, but

iris[iris$Species != "setosa",][sample(nrow(iris[iris$Species != "setosa",]), 2), ]
colcarroll
  • 3,632
  • 17
  • 25
4

Clean and simple data table approach:

require(data.table)
iris <- data.table(iris)
cond <- iris[Species!= 'setosa', which = T]
iris[sample(cond, 2)]
Andreas
  • 1,923
  • 19
  • 24