1

This is the dataset:

company <- c("Coca-Cola Inc.", "DF, CocaCola", 
         "COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand  <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol  <- c("2456","1653", "19", "2766", "167")
data   <-data.frame(company, brand, vol)
data

Which results in:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4           PepsiCo Inc.             Pepsi   2766
5 Beverages Distribution        soft drink    167

Let's say, this is imported volume by brand.

The task is to SUBSET the dataframe to only see observations related to Coca-Cola, not any other brand.

  • The problem is that Coca-Cola is written in many different ways.
  • Also, we know that Beverages Distribution company only imports Coca-cola, even though it is not indicated in the table above.

We need to partially match COMPANY and BRAND variables against a list of criteria (keys):

company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")

I am struggling to execute this idea:

SUBSET data IF brand PARTIALLY MATCHES ANY key from brand_key vector OR company PARTIALLY MATCHES ANY key from company_key

So, leave only the lines in which :

(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")

OR

(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")

Note: Needs to be NOT case-sensitive

The desirable output:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4 Beverages Distribution        soft drink    167

Any ideas? Thanks in advance :)

denisafonin
  • 1,116
  • 1
  • 7
  • 16
  • 1
    for partial matching use `grepl` for ingore case use build in `ignore.case=T` or add `"(?i)"` in your regEx pattern. Use `paste` with `collapse="|"` for your key-vectors. – Andre Elrico Jul 20 '18 at 13:41
  • @Parfait sorry for the wrong phrasing. This is exactly what I need, a help to figure out a task i have at my hands – denisafonin Jul 20 '18 at 13:43

2 Answers2

5

Using regex and its | (or) operator. Parameter ignore.case deals with the case.

index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
    grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)

data[index,]  

#                 company          brand  vol
#1         Coca-Cola Inc. Coca-Cola Zero 2456
#2           DF, CocaCola            N/A 1653
#3              COCA-COLA      Coca-Cola   19
#5 Beverages Distribution     soft drink  167
Ape
  • 1,159
  • 6
  • 11
1

Considering that coca can be followed by either a dash or a cola preceded by optional spaces. I paste both columns together for the coca search and make a different test for Beverage Distribution

data[grepl("coca-|(\\s*cola)", paste(data[,1], data[,2]), ignore.case = T) |
       grepl("Beverages Distribution",data[,1]),]
#                  company          brand  vol
# 1         Coca-Cola Inc. Coca-Cola Zero 2456
# 2           DF, CocaCola            N/A 1653
# 3              COCA-COLA      Coca-Cola   19
# 5 Beverages Distribution     soft drink  167

If Beverage Distribution can only be a complete match you may want to change the second part to data[,1] == "Beverages Distribution"

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167