7

So I've got a dataset that looks something like this:

a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks

And I know I can do a random subset of rows with a nested conditional, but what I want to do is create a random subset that takes at least one of each available value in column 'c'.

So possible correct subsets would be

23  34  Falcons
14  9   Hawks
3   21  Eagles

or

11  4   Hawks
2   18  Eagles
22  8   Falcons

[in no particular order], but something like:

14  9   Hawks
2   18  Eagles
3   21  Eagles

would NOT work, because 'Falcons' is not represented. Is there an easy way to do this in R?

Community
  • 1
  • 1
skathan
  • 641
  • 7
  • 16
  • do you just want one from each group? – rawr Mar 08 '16 at 01:38
  • So the code could return a row count somewhere between n=1 and n=all for each `df$c` ? – thelatemail Mar 08 '16 at 01:39
  • The code should return a subset of whatever number you want, so long as it's greater than `nrow(df%>% group_by(c) %>% tally())`, so, in this case, it should be any number you chose between 3 and 6. So to answer your question, rawr, no, I don't want just one from each group. I'll make that clearer in the question. – skathan Mar 08 '16 at 01:46
  • I'm not a `dplyr`-er normally, but a `sample` of rows determined by `sample` should do it - `df[unlist(by(rownames(df), df$c, function(x) sample(x, sample(seq_along(x),1) ) )),]` for example. – thelatemail Mar 08 '16 at 01:49
  • you can use this to set the n you want from each group methinks `n <- setNames(c(1,2,1), unique(dd$c)); dd[!!as.numeric(ave(dd$c, dd$c, FUN = function(x) sample(rep(0:1, c(length(x) - n[x[1]], n[x[1]]))))), ]` – rawr Mar 08 '16 at 01:53
  • Rawr and thelatemail, both good answers, but what if I wanted to just set the number of rows in the subset? – skathan Mar 08 '16 at 01:58
  • @skathan set n = 1 for each group? – rawr Mar 08 '16 at 02:00
  • @skathan - use `min` to select however many you want or the number of rows - e.g.: `df[unlist(by(rownames(df), df$c, function(x) sample(x, min(length(x),3) ) )),]` – thelatemail Mar 08 '16 at 02:02

2 Answers2

3

Use group_by and sample_n functions in the dplyr package.

text1 <- "a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks"

dat <- read.table(text=text1, head=T, as.is=T)

library(dplyr)
dat %>% group_by(c) %>% sample_n(1)

# Source: local data frame [3 x 3]
# Groups: c [3]

#       a     b       c
#   (int) (int)   (chr)
# 1     3    21  Eagles
# 2    22     8 Falcons
# 3    11     4   Hawks

UPDATE: You can write a function to do the sampling.

sample_df <- function(df) {
  df.r <- sample(1:nrow(df), 1)
  return(sample_n(df, df.r))
}
dat %>% group_by(c) %>% do(sample_df(.))
Ven Yao
  • 3,680
  • 2
  • 27
  • 42
  • This takes care of half of the issue -- what I'm looking for is a subset of variable length, not just one that groups by c. This solution gives a random subset equal to the number of unique entities in column c, but let's say for example I wanted 4 or 5 values in that subset -- how would I go about that? – skathan Mar 08 '16 at 01:48
  • Almost perfect, but how would I manually set the size of the sample? Let's say I explicitly want 4 entries, for example. – skathan Mar 08 '16 at 02:05
2

You can specify the n for each group here (use 1s if you only want a data frame with nrows == number of groups

dd <- read.table(header = TRUE, text = 'a   b   c
23  34  Falcons
14  9   Hawks
2   18  Eagles
3   21  Eagles
22  8   Falcons
11  4   Hawks', stringsAsFactors = FALSE)

(n <- setNames(c(1,2,1), unique(dd$c)))
# Falcons   Hawks  Eagles 
#       1       2       1 

set.seed(1)
dd[as.logical(ave(dd$c, dd$c, FUN = function(x)
  sample(rep(c(FALSE, TRUE), c(length(x) - n[x[1]], n[x[1]]))))), ]

#    a  b       c
# 1 23 34 Falcons
# 2 14  9   Hawks
# 4  3 21  Eagles
# 6 11  4   Hawks

Putting this into a function to automate some other things for you

sample_each <- function(data, var, n = 1L) {
  lvl <- table(data[, var])
  n1 <- setNames(rep_len(n, length(lvl)), names(lvl))
  n0 <- lvl - n1
  idx <- ave(as.character(data[, var]), data[, var], FUN = function(x)
    sample(rep(0:1, c(n0[x[1]], n1[x[1]]))))
  data[!!(as.numeric(idx)), ]
}

sample_each(dd, 'c', n = c(1,2,1))
#    a  b       c
# 1 23 34 Falcons
# 3  2 18  Eagles
# 5 22  8 Falcons
# 6 11  4   Hawks

sample_each(mtcars, 'gear', 1)
#                mpg cyl  disp  hp drat   wt  qsec vs am gear carb
# Valiant       18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1
# Merc 280      19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
# Maserati Bora 15.0   8 301.0 335 3.54 3.57 14.60  0  1    5    8


sample_each(mtcars, 'gear', c(2,2,5))
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Porsche 914-2      26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Lotus Europa       30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# Ford Pantera L     15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
# Ferrari Dino       19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
# Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
# Mazda RX4 Wag1     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# Hornet Sportabout1 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Merc 2801          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
rawr
  • 20,481
  • 4
  • 44
  • 78