0

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:

List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])

library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster


names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
  name_2<-List_of_names[i,]
  foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
    if(length(grep(Comma_names[j,1],name_2))>0){
      name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
    }
  }
  name_2
}

In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?

Thank you everyone who is reads this and is able to help!

Fred
  • 35
  • 1
  • 3
  • Possible duplicate of [Replace multiple strings in one gsub() or chartr() statement in R?](https://stackoverflow.com/questions/33949945/replace-multiple-strings-in-one-gsub-or-chartr-statement-in-r) – ismirsehregal Jul 24 '19 at 09:13

1 Answers1

2

Principle

You can use a combination from Reduce and stri_replace_all from package stringi.

Code

library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"), 
                              replace_names = c("Franz Jr.", "Nice LLC", "John Sr")), 
                              .Names = c("name", "replace_names"), 
                              row.names = c(NA, -3L), class = "data.frame")


List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
                                         "Digital, Mike, John, Sr", "Svenja, Sven")), 
                                .Names = "name", 
                                row.names = c(NA, -3L), class = "data.frame")

wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind], 
                                               fixed = Comma_names$name[ind])

ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"                 
# [3] "Svenja, Sven" 

Explanation

stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.

Benchmark

Just a little benchmark with 3 millions lines:

List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user  system elapsed 
# 1.95    0.00    1.96
thothal
  • 16,690
  • 3
  • 36
  • 71
  • Amazing! Thank you so much! With my 2 foreach loops it took: # user system elapsed #a 132.39 106.37 728.22 Whereas you solution took less than a second! Thanks again! Such a great answer! – Fred Mar 04 '16 at 14:19
  • Ditch wrapper & Reduce & instead just do: `List_of_names$name = stri_replace_all_fixed(List_of_names$name, Comma_names[,1], Comma_names[,2], vectorize_all=T))` – webb Mar 07 '16 at 05:39
  • @webb actually, that would not work, because then you will replace the first name in the first row, the second name in the second row and so on. It was also just a coincidence that the line did not give a warning with the current example, as both data frames contained 3 lines. Try `stri_replace_all_fixed(List_of_names$name[c(1, 2, 2, 3)], Comma_names[,1], Comma_names[,2], vectorize_all = TRUE)` and you will see the warning. – thothal Mar 07 '16 at 09:19