1
Model<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C")
Price<-c(12,14,15,13,16,36,32,24,14,15,14,11,24,31)
region<-c("W","E","E","W","W","E","E","E","E","W","W","W","E","W")
dt<-data.frame(Model,Price,region)

 Model Price region
1      A    12      W
2      A    14      E
3      A    15      E
4      A    13      W
5      A    16      W
6      B    36      E
7      B    32      E
8      B    24      E
9      B    14      E
10     B    15      W
11     C    14      W
12     C    11      W
13     C    24      E
14     C    31      W
> 

What I want to do is delete the rows if only one W or E happened in that Model type. We keep all rows for model A. We delete the 10th row because only 1 W in model B. And we also delete 13th row because only 1 E in model C.

How can do this in R? I have about 20,000 observations with thousands of model type. I may need to write a loop.

Yanying Li
  • 11
  • 6
  • 2
    are you just looking at the two columns or is your data more complicated? `dt[duplicated(dt[, -2]) | duplicated(dt[, -2], fromLast = TRUE), ]` – rawr Apr 05 '16 at 22:48
  • This works. Thank you! But can we also take away the model type if all those are in the same region. Let's change the 13th row to (C,24,W). Right now, I wanna delete whole type C because all C are in W region. – Yanying Li Apr 05 '16 at 23:41

2 Answers2

4
Model<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C")
Price<-c(12,14,15,13,16,36,32,24,14,15,14,11,24,31)
region<-c("W","E","E","W","W","E","E","E","E","W","W","W","E","W")
dt<-data.frame(Model,Price,region)

these will be removed

dt[!(duplicated(dt[, -2]) | duplicated(dt[, -2], fromLast = TRUE)), ]

#    Model Price region
# 10     B    15      W
# 13     C    24      E

these will be kept

dt[duplicated(dt[, -2]) | duplicated(dt[, -2], fromLast = TRUE), ]

#    Model Price region
# 1      A    12      W
# 2      A    14      E
# 3      A    15      E
# 4      A    13      W
# 5      A    16      W
# 6      B    36      E
# 7      B    32      E
# 8      B    24      E
# 9      B    14      E
# 11     C    14      W
# 12     C    11      W
# 14     C    31      W

For 20k observations, almost 5000 model types

set.seed(1)
n <- 20000
dd <- data.frame(Model = sample(1:5000, n, TRUE),
                 Price = rpois(n, 15),
                 region = sample(c('E','W'), n, TRUE))

dim(dd[duplicated(dd[, -2]) | duplicated(dd[, -2], fromLast = TRUE), ])
# [1] 17289     3

If you want more control over the number, you can use something like the following which is nearly as quick although I only tried up to 200k obs and 10k models. Change the 1 to some other number

dim(dd[ave(as.numeric(dd$region), dd[, -2], FUN = length) > 1, ])
# [1] 17289     3

dt[ave(as.numeric(dt$region), dt[, -2], FUN = length) > 1, ]

#    Model Price region
# 1      A    12      W
# 2      A    14      E
# 3      A    15      E
# 4      A    13      W
# 5      A    16      W
# 6      B    36      E
# 7      B    32      E
# 8      B    24      E
# 9      B    14      E
# 11     C    14      W
# 12     C    11      W
# 14     C    31      W
rawr
  • 20,481
  • 4
  • 44
  • 78
  • This works. Thank you! But can we also take away the model type if all those are in the same region. Let's change the 13th row to (C,24,W). Right now, I wanna delete whole type C because all C are in W region. – Yanying Li Apr 05 '16 at 23:43
  • No reason to use `as.numeric` it in your `ave`, is there? – Frank Apr 06 '16 at 00:02
  • @YanyingLi `dt[with(dt, ave(as.numeric(Model), Model, FUN = length) != ave(as.numeric(Model), dt[, -2], FUN = length)), ]` is this independent of the current question or do you mean simultaneously? – rawr Apr 06 '16 at 00:47
  • @Frank if region is a factor in OP's original data set, you will get invalid level warnings. but you're right about the second example – rawr Apr 06 '16 at 00:48
  • @rawr This is an independent question. Lets go back to the original data I posted. How about I want to get rid of the model type if all of them drop in a single region? – Yanying Li Apr 06 '16 at 15:26
  • @rawr (More explanation for last comment) If I use duplicated(dt[,-2]), I would like to delete the model type which shows only the first FALSE, all other following are TRUE. – Yanying Li Apr 06 '16 at 15:56
  • @YanyingLi so just the reverse of what `duplicated(dt[,-2])` gives you? so `!duplicated(dt[,-2])`, does that work? – rawr Apr 06 '16 at 17:55
  • @rawr not really.I expect the whole model type be dropped if duplicated() shows the first false and all other following are true. – Yanying Li Apr 06 '16 at 21:29
  • @YanyingLi the code in the above comment will do that as will this which is a little shorter `dt[!ave(duplicated(dt[, -2]) == duplicated(dt$Model), dt$Model, FUN = all), ]` – rawr Apr 06 '16 at 21:38
1

You can create a counter variable and filter by that. Using dplyr package:

library(dplyr) dt <- dt %>% group_by(Model) %>% filter(n_distinct(region) > 1) %>% group_by(Model, region) %>% filter(n() > 1)

Psidom
  • 209,562
  • 33
  • 339
  • 356