I am working with R and I have two data frames. One data frame my_data
is my main dataset that contains order data, the other one, word_list
, contains a list of words that I would like to match with my_data
.
Here is a reproducible example of the two data frames:
my_data <- data.frame(
Order = c("1","2", "3", "4", "5", "6"),
Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"))
word_list <- data.frame(
Codes = c("TS678","AB123", "GF123", "CC756"),
Product_Category = c("Apple", "Apple", "Orange", "Orange"))
What I would like to do is to match the Product_ID in my_data
with the Codes in word_list
and add a new column to my_data
with the matching Product_Category from word_list
.
However, I need to implement exact matches as well as consider Code combinations (as seen with "PACK" in the sample data, which consists of multiple product codes in one column)
For the final dataframe I want to end up with the following:
- Match the exact matches -> add the corresponding Product_Category, e.g. "Apple"
- Match columns that contain the code from
word_list
, but also contain other codes. Certain products are Packs and the ID is mixed with other IDs -> this should result "Apple + Other" if the code for "apple" is contained plus other codes are contained. Another issue here is that the Code that needs to be matched is also accompanied by a count (e.g., PACK12 includes 1x GF123, 1xML680, etc.) - All columns that do not contain the exact match nor a mixed match should be assigned "Other"
To make it better understandable, what I would like to get as a final result is a dataframe that looks like the following:
my_data_result <- data.frame(
Order = c("1","2", "3", "4", "5", "6"),
Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"),
Product_Category = c("Apple", "Apple", "Orange + Other", "Apple", "Other", "Orange"))
I assume this could be done with regex & gsub, but I am not sure how.
Thank you!