This is the dataset:
company <- c("Coca-Cola Inc.", "DF, CocaCola",
"COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol <- c("2456","1653", "19", "2766", "167")
data <-data.frame(company, brand, vol)
data
Which results in:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 PepsiCo Inc. Pepsi 2766
5 Beverages Distribution soft drink 167
Let's say, this is imported volume by brand.
The task is to SUBSET the dataframe to only see observations related to Coca-Cola, not any other brand.
- The problem is that Coca-Cola is written in many different ways.
- Also, we know that Beverages Distribution company only imports Coca-cola, even though it is not indicated in the table above.
We need to partially match COMPANY and BRAND variables against a list of criteria (keys):
company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")
I am struggling to execute this idea:
SUBSET data IF brand PARTIALLY MATCHES ANY key from brand_key vector OR company PARTIALLY MATCHES ANY key from company_key
So, leave only the lines in which :
(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")
OR
(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")
Note: Needs to be NOT case-sensitive
The desirable output:
company brand vol
1 Coca-Cola Inc. Coca-Cola Zero 2456
2 DF, CocaCola N/A 1653
3 COCA-COLA CocaCola 19
4 Beverages Distribution soft drink 167
Any ideas? Thanks in advance :)