0

Say I have a dataframe as follows:

df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
                          Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
   Region Combo
1       A     1
2       A     2
3       A     3
4       B     1
5       B     2
6       C     1
7       D     1
8       D     2
9       D     3
10      D     4

What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.

If the chosen combination were indicated by a binary variable, it would look something potentially like this:

   Region Combo RandomlyChosen
1       A     1              1
2       A     2              0
3       A     3              0
4       B     1              0
5       B     2              1
6       C     1              1
7       D     1              0
8       D     2              0
9       D     3              1
10      D     4              0

I'm aware of the sample function, but just don't know how to choose only one combo within each region.

I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.

Thanks!

Bucket
  • 527
  • 2
  • 16
  • `df[, Rand := .I==sample(as.character(.I),1), by=Region]` maybe for a data.table solution. Though I suspect there may be something much better, – thelatemail Mar 14 '16 at 22:30
  • Using base R: `sapply(split(df$Combo, df$Region), function(z) z[sample(1:length(z), 1)])` – Raad Mar 14 '16 at 22:31
  • @NBATrends That returns named numbers, not a data.frame. For that you'd need something like `do.call(rbind, lapply(split(df, df$Region), function(x){x[sample(seq_along(x$Combo), 1),]}))` – alistaire Mar 14 '16 at 22:42
  • @alistaire am aware was just suggesting that base R be considered. I definitely like the answer below using tapply. – Raad Mar 14 '16 at 22:52
  • Another base version: `df$RandomlyChosen <- unlist(lapply(split(df, df$Region), function(x){sample(c(1, rep(0, nrow(x)-1)))}))` – alistaire Mar 14 '16 at 23:00
  • @alistaire I like your last solution... However, I actually need to split my df over two columns... Then the lapply no longer works since x<-split[i] works over one column, but now I need x <-split[[i]].. if that makes sense? How do I make it reference the list with the double brackets? (I'm actually curious in general how to do this too) – Bucket Mar 14 '16 at 23:55
  • You can pass `split` a list of variables to split on `split(df, list(df$Region, df$Combo))`, but `dplyr` is probably easier: `df %>% group_by(Region, Combo) %>% mutate(Chosen = sample(c(1L, rep(0L, n()-1L))))` – alistaire Mar 15 '16 at 03:11
  • As for list indexing, if you save the result of `split` to `x`: `x <- split(df, df$Region)`, then you can subset `x` in any number of ways. The elements of `x` are named after the levels of `Region`, so `x$A` will return the data.frame with rows where `Region` is `A`. `x['A']` is actually a little different, because that returns a list of that same data.frame. To get the same as `x$A`, you need `x[['A']]`. Since that is a data.frame, you can subset it normally with another set of brackets: `x[['A']][2, 'Combo']` – alistaire Mar 15 '16 at 03:17
  • ...or maybe you're asking for `mapply`/`Map`, which are multivariate versions of `sapply` and `lapply`? I may have misunderstood. – alistaire Mar 15 '16 at 03:19

1 Answers1

1

In plain R you can use sample() within tapply():

df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
   Region Combo Chosen
1       A     1      0
2       A     2      1
3       A     3      0
4       B     1      1
5       B     2      0
6       C     1      1
7       D     1      0
8       D     2      0
9       D     3      1
10      D     4      0

Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

HubertL
  • 19,246
  • 3
  • 32
  • 51