2

I need some help in R. I want to find which brand appears in every ID. My data contains three tab delimited fields, lets call them ID (which identify a person), time, brand:

ID time brand
01 1111 BMW
01 1112 BMW
01 1113 Audi
01 1114 Mercedes
02 1115 Audi
02 1115 Cadillac
02 1116 BMW
03 1117 Acura
03 1117 Audi
03 1117 BMW
03 1118 Mercedes
.......

According to the data, therefore, the brands that is common to all IDs are BMW and Audi. Time field is not always unique. However since I have many more records, I can't be sure if there is a common brand to all the IDs. In which case I want to get a rank of what brand appears in most ID.

I have many unique IDs and brands, so I can't use the explanations here How to find common elements from multiple vectors?

I would like to know how to do it in R. Thank you.

Community
  • 1
  • 1
Rick Blaine
  • 175
  • 1
  • 6

2 Answers2

1
data = read.table(text = "ID time brand
        01 1111 BMW
        01 1112 BMW
        01 1113 Audi
        01 1114 Mercedes
        02 1115 Audi
        02 1115 Cadillac
        02 1116 BMW
        03 1117 Acura
        03 1117 Audi
        03 1117 BMW
        03 1118 Mercedes", header = TRUE)

require("data.table")
DT <- as.data.table(data)

# Count instances of brand
DT[, count := .N, by="brand"]

# Keep the brand that appears more times by ID
result = DT[, list(max_count = max(count), time), by="brand,ID"]
result

# Drop max_count column by reference in case you want to
set(result, j=c("max_count"), value=NULL)

print(result)
#       brand ID time
# 1:      BMW  1 1111
# 2:      BMW  1 1112
# 3:     Audi  1 1113
# 4: Mercedes  1 1114
# 5:     Audi  2 1115
# 6: Cadillac  2 1115
# 7:      BMW  2 1116
# 8:    Acura  3 1117
# 9:     Audi  3 1117
# 10:      BMW  3 1117
# 11: Mercedes  3 1118
marbel
  • 7,560
  • 6
  • 49
  • 68
0

You could use a table and find those brands that appear in each ID.

> tab <- table(dat$ID, dat$brand)
> names(which(apply(tab, 2, function(x) !any(x == 0))))
## [1] "Audi" "BMW"

For those brands that do not appear in each ID,

> A <- apply(tab, 2, sum)
> A[A < nrow(tab)]
## Acura Cadillac Mercedes 
##     1        1        2 

And to sort them most to least

> sort(A[A < nrow(tab)], decreasing = TRUE)
## Mercedes    Acura Cadillac 
##        2        1        1 
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Your first answer falls over if say there are 2 BMWs for both ID1 and ID2. I.e. - the sum will be 4, but there will still be one out of the 3 IDs without a BMW. – thelatemail May 02 '14 at 00:39