How to randomly choose only one row in each group

Question

Say I have a dataframe as follows:

df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
                          Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
   Region Combo
1       A     1
2       A     2
3       A     3
4       B     1
5       B     2
6       C     1
7       D     1
8       D     2
9       D     3
10      D     4

What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.

If the chosen combination were indicated by a binary variable, it would look something potentially like this:

   Region Combo RandomlyChosen
1       A     1              1
2       A     2              0
3       A     3              0
4       B     1              0
5       B     2              1
6       C     1              1
7       D     1              0
8       D     2              0
9       D     3              1
10      D     4              0

I'm aware of the sample function, but just don't know how to choose only one combo within each region.

I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.

Thanks!

`df[, Rand := .I==sample(as.character(.I),1), by=Region]` maybe for a data.table solution. Though I suspect there may be something much better, — thelatemail, Mar 14 '16 at 22:30
Using base R: `sapply(split(df$Combo, df$Region), function(z) z[sample(1:length(z), 1)])` — Raad, Mar 14 '16 at 22:31
@NBATrends That returns named numbers, not a data.frame. For that you'd need something like `do.call(rbind, lapply(split(df, df$Region), function(x){x[sample(seq_along(x$Combo), 1),]}))` — alistaire, Mar 14 '16 at 22:42
@alistaire am aware was just suggesting that base R be considered. I definitely like the answer below using tapply. — Raad, Mar 14 '16 at 22:52
Another base version: `df$RandomlyChosen <- unlist(lapply(split(df, df$Region), function(x){sample(c(1, rep(0, nrow(x)-1)))}))` — alistaire, Mar 14 '16 at 23:00
@alistaire I like your last solution... However, I actually need to split my df over two columns... Then the lapply no longer works since x<-split[i] works over one column, but now I need x <-split[[i]].. if that makes sense? How do I make it reference the list with the double brackets? (I'm actually curious in general how to do this too) — Bucket, Mar 14 '16 at 23:55
You can pass `split` a list of variables to split on `split(df, list(df$Region, df$Combo))`, but `dplyr` is probably easier: `df %>% group_by(Region, Combo) %>% mutate(Chosen = sample(c(1L, rep(0L, n()-1L))))` — alistaire, Mar 15 '16 at 03:11
As for list indexing, if you save the result of `split` to `x`: `x <- split(df, df$Region)`, then you can subset `x` in any number of ways. The elements of `x` are named after the levels of `Region`, so `x$A` will return the data.frame with rows where `Region` is `A`. `x['A']` is actually a little different, because that returns a list of that same data.frame. To get the same as `x$A`, you need `x[['A']]`. Since that is a data.frame, you can subset it normally with another set of brackets: `x[['A']][2, 'Combo']` — alistaire, Mar 15 '16 at 03:17
...or maybe you're asking for `mapply`/`Map`, which are multivariate versions of `sapply` and `lapply`? I may have misunderstood. — alistaire, Mar 15 '16 at 03:19

HubertL · Accepted Answer · 2016-03-14T22:48:34.437

In plain R you can use sample() within tapply():

df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
   Region Combo Chosen
1       A     1      0
2       A     2      1
3       A     3      0
4       B     1      1
5       B     2      0
6       C     1      1
7       D     1      0
8       D     2      0
9       D     3      1
10      D     4      0

Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

How to randomly choose only one row in each group

1 Answers1

Linked

Related