2

I am working with R and I have two data frames. One data frame my_data is my main dataset that contains order data, the other one, word_list, contains a list of words that I would like to match with my_data.

Here is a reproducible example of the two data frames:

my_data <- data.frame(
  Order = c("1","2", "3", "4", "5", "6"),
  Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"))

word_list <- data.frame(
  Codes = c("TS678","AB123", "GF123", "CC756"),
  Product_Category = c("Apple", "Apple", "Orange", "Orange"))

What I would like to do is to match the Product_ID in my_data with the Codes in word_list and add a new column to my_data with the matching Product_Category from word_list.
However, I need to implement exact matches as well as consider Code combinations (as seen with "PACK" in the sample data, which consists of multiple product codes in one column)

For the final dataframe I want to end up with the following:

  1. Match the exact matches -> add the corresponding Product_Category, e.g. "Apple"
  2. Match columns that contain the code from word_list, but also contain other codes. Certain products are Packs and the ID is mixed with other IDs -> this should result "Apple + Other" if the code for "apple" is contained plus other codes are contained. Another issue here is that the Code that needs to be matched is also accompanied by a count (e.g., PACK12 includes 1x GF123, 1xML680, etc.)
  3. All columns that do not contain the exact match nor a mixed match should be assigned "Other"

To make it better understandable, what I would like to get as a final result is a dataframe that looks like the following:

 my_data_result <- data.frame(
  Order = c("1","2", "3", "4", "5", "6"),
  Product_ID = c("TS678", "AB123", "PACK12, 1xGF123, 1xML680", "AB123", "PACK13, 1xML680, 1x2304TR", "GF123"),
  Product_Category = c("Apple", "Apple", "Orange + Other", "Apple", "Other", "Orange"))

I assume this could be done with regex & gsub, but I am not sure how.

Thank you!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
emil_rore
  • 115
  • 1
  • 7
  • Just `merge()`... – Sotos Jul 29 '20 at 12:15
  • Hi @Sotos! I am not sure, but as far as I understand, "merge" would not consider mixed matches, so columns where I have more than just one Product ID - like., "PACK12, 1xGF123, 1xML680" in the above example. – emil_rore Jul 29 '20 at 12:18
  • Apologies. I did not catch that. – Sotos Jul 29 '20 at 12:20
  • @Sotos, maybe one more thing to add: I also do not have ALL product codes in my matching list, but only the codes for "Apple" and "Orange". All other product codes are irrelevant, but I need to consider that they sometimes co-occur with Apple and Orange in one column. – emil_rore Jul 29 '20 at 12:25

2 Answers2

1

Since your data is huge you can try this data.table approach :

library(data.table)
library(splitstackshape)

#Convert to data.table
setDT(my_data)
setDT(word_list)

#Get the data in long format
df1 <- cSplit(my_data, 'Product_ID', direction = 'long')
#Remove initial characters 
df1[, Product_ID := sub('.*x', '', Product_ID)]

#Join the dataframes
df1 <- merge(df1, word_list, by.x = 'Product_ID', by.y = 'Codes', all.x = TRUE)
#Replace NA with "Other"
df1[, Product_Category := replace(Product_Category, 
                           is.na(Product_Category), 'Other')]

#Combine the values by Order
df1[, .(Product_ID = toString(Product_ID), 
       Product_Category = paste(sort(unique(Product_Category)), 
                          collapse = " + ")), Order]

#   Order            Product_ID Product_Category
#1:     5 2304TR, ML680, PACK13            Other
#2:     2                 AB123            Apple
#3:     4                 AB123            Apple
#4:     3  GF123, ML680, PACK12   Orange + Other
#5:     6                 GF123           Orange
#6:     1                 TS678            Apple
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey @Ronak! Thank you very much for your answer! I have tried it and the code exectues fast. However, what I have noticed is that it gives me different categories: I get "Orange + Other" as well as "Other + Orange" in the final result. Is there a way to avoid this? Thank you! – emil_rore Jul 30 '20 at 09:22
  • @emil_rore Yes, you can `sort` it so the order is always the same. I have updated the answer. – Ronak Shah Jul 30 '20 at 09:34
  • I also notices that it makes combinations such as "Apple + Orange + Other", "Other + Orange + Apple", "Orange + Apple + Other", etc., although they are all the same category. I have mapped them manually. Howevery, do you think there is a way to avoid this? Thanks again! – emil_rore Jul 30 '20 at 10:14
  • Hmmm...that's strange. It should not happen since we are using `sort` here. It should always give the correct order. Any way to reproduce it? – Ronak Shah Jul 30 '20 at 11:18
0

Here is an idea using dplyr and tidyr. We split the rows to long, clean the codes, match with word_list and convert back to strings per Order, i.e.

library(dplyr)
library(tidyr)

my_data %>% 
 separate_rows(Product_ID, sep = ', ') %>% 
 mutate(Product_ID = sub('.*x', '', Product_ID), 
        Product_Category = as.character(word_list$Product_Category[match(Product_ID, word_list$Codes)]), 
        Product_Category = replace(Product_Category, is.na(Product_Category), 'Other')) %>%
 group_by(Order) %>% 
 summarise_all(list(~toString(unique(.))))

# A tibble: 6 x 3
#  Order Product_ID            Product_Category
#  <fct> <chr>                 <chr>           
#1 1     TS678                 Apple           
#2 2     AB123                 Apple           
#3 3     PACK12, GF123, ML680  Other, Orange   
#4 4     AB123                 Apple           
#5 5     PACK13, ML680, 2304TR Other           
#6 6     GF123                 Orange
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • Hi @Sotos! Thank you so much for your answer! I am trying this code, however, the code is already running for one hour and I am not sure what the problem is. My dataset is pretty large (900k observations) , so this might be an issue. Is there any solution that I could do to process the data faster? It would also be okay to do it in multiple steps and/or overwrite "Product_ID" in the final table. Thank you! – emil_rore Jul 29 '20 at 17:36
  • 900K is not much. It shouldn't be a problem. Try restarting your session – Sotos Jul 30 '20 at 06:15
  • Hey @Sotos,Yesterday, after running it for more than an hour, it didn't seem to merge the Product_Categories. Thus, I have restarted the R session and also exported a file where rows are already separated so that the first "separate_rows" function is not needed (to save time). I have no clue why, but for some reason, the code still does not seem progess and takes ages. – emil_rore Jul 30 '20 at 09:21