Subset a df using partial match with multiple criteria

Question

This is the dataset:

company <- c("Coca-Cola Inc.", "DF, CocaCola", 
         "COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand  <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol  <- c("2456","1653", "19", "2766", "167")
data   <-data.frame(company, brand, vol)
data

Which results in:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4           PepsiCo Inc.             Pepsi   2766
5 Beverages Distribution        soft drink    167

Let's say, this is imported volume by brand.

The task is to SUBSET the dataframe to only see observations related to Coca-Cola, not any other brand.

The problem is that Coca-Cola is written in many different ways.
Also, we know that Beverages Distribution company only imports Coca-cola, even though it is not indicated in the table above.

We need to partially match COMPANY and BRAND variables against a list of criteria (keys):

company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")

I am struggling to execute this idea:

SUBSET data IF brand PARTIALLY MATCHES ANY key from brand_key vector OR company PARTIALLY MATCHES ANY key from company_key

So, leave only the lines in which :

(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")

OR

(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")

Note: Needs to be NOT case-sensitive

The desirable output:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4 Beverages Distribution        soft drink    167

Any ideas? Thanks in advance :)

for partial matching use `grepl` for ingore case use build in `ignore.case=T` or add `"(?i)"` in your regEx pattern. Use `paste` with `collapse="|"` for your key-vectors. — Andre Elrico, Jul 20 '18 at 13:41
@Parfait sorry for the wrong phrasing. This is exactly what I need, a help to figure out a task i have at my hands — denisafonin, Jul 20 '18 at 13:43

Ape · Accepted Answer · 2018-07-20T13:47:06.330

5

Using regex and its | (or) operator. Parameter ignore.case deals with the case.

index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
    grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)

data[index,]  

#                 company          brand  vol
#1         Coca-Cola Inc. Coca-Cola Zero 2456
#2           DF, CocaCola            N/A 1653
#3              COCA-COLA      Coca-Cola   19
#5 Beverages Distribution     soft drink  167

edited Jul 20 '18 at 13:47

answered Jul 20 '18 at 13:36

Ape

1,159
6
11

1

very good. See my comment for the ignore case thing. This could be improved. – Andre Elrico Jul 20 '18 at 13:43
1

improved means more general. If your keys contains a single UPPER LETTER this solution will not work anymore. – Andre Elrico Jul 20 '18 at 13:45
1

@AndreElrico yep, thanks! I figured that out. A more general solution would be to add the ignore.case criteria then – denisafonin Jul 20 '18 at 13:48

score 1 · Answer 2 · answered Jul 20 '18 at 14:14

Considering that coca can be followed by either a dash or a cola preceded by optional spaces. I paste both columns together for the coca search and make a different test for Beverage Distribution

data[grepl("coca-|(\\s*cola)", paste(data[,1], data[,2]), ignore.case = T) |
       grepl("Beverages Distribution",data[,1]),]
#                  company          brand  vol
# 1         Coca-Cola Inc. Coca-Cola Zero 2456
# 2           DF, CocaCola            N/A 1653
# 3              COCA-COLA      Coca-Cola   19
# 5 Beverages Distribution     soft drink  167

If Beverage Distribution can only be a complete match you may want to change the second part to data[,1] == "Beverages Distribution"

Subset a df using partial match with multiple criteria

2 Answers2

Linked