1

I'm trying to subset data according to conditions in different rows. In other words, I'd like to subset customers who visit store A and store B.

Here is an example:

dfrm <- data.frame(cust_id=sample(c(1:3),5,replace = TRUE), 
                   store=sample(LETTERS[1:2],5,replace = TRUE))

> dfrm
     cust_id store
>>       3     B
>>       3     A
>>       1     B
>>       1     B
>>       2     B

and I'd like to get a subset of cust_id = 3, because he visit A and B.

data

dfrm <-
structure(list(cust_id = c(3, 3, 1, 1, 2), store = structure(c(2L, 
1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("cust_id", "store"),
row.names = c(NA, -5L), class = "data.frame")
staove7
  • 560
  • 5
  • 18
  • 1
    Possible duplicate of [Select rows from a data frame based on values in a vector](http://stackoverflow.com/questions/11612235/select-rows-from-a-data-frame-based-on-values-in-a-vector) – Sraffa May 09 '17 at 18:46
  • Sraffa, I've looked over the question you linked to, and I don't think it answers my problem (maybe I missed something). @Lamia already solved my problem, but TKS anyway. – staove7 May 09 '17 at 19:05
  • 2
    I added a reproducible version of your data set. Your's is not reproducible because you did not include `set.seed`. – lmo May 09 '17 at 19:06
  • Suppose `cust_id` has two `B`-values and one `A`-value. Which values do you then want to keep? – Jaap May 09 '17 at 19:07
  • 1
    A base R solution to keep all observations of customers who visit all stores is `dfrm[as.logical(ave(as.character(dfrm$store), dfrm$cust_id, FUN=function(x) all(unique(as.character(dfrm$store)) %in% x))), ]`. – lmo May 09 '17 at 19:08

2 Answers2

2

For the sake of completeness, here are also two data.table solutions:

The first one returns all rows of dfrm of all customers who visited exactly two different stores:

library(data.table)
setDT(dfrm)[, .SD[uniqueN(store) == 2L], by = cust_id]
#   cust_id store
#1:       3     B
#2:       3     A

This is what also the dplyr solution does but about 25% faster for a data set of 80 k rows.

The second one returns all rows of dfrm of all customers who visited the maximum number of different stores (which is also 2 in the given toy data sample):

setDT(dfrm)[, .SD[uniqueN(store) == uniqueN(dfrm[["store"]])], by = cust_id]
#   cust_id store
#1:       3     B
#2:       3     A

This is what also the base R solution does but about 15 times faster for a data set of 80 k rows.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
1

Using the dplyr package, you could do:

dfrm %>% group_by(cust_id) %>% filter(n_distinct(store,na.rm=T)==2)

Which returns the customers that visit two different stores:

cust_id  store
    <dbl> <fctr>
1       3      B
2       3      A
Lamia
  • 3,845
  • 1
  • 12
  • 19