4

I need to replace subset of a string with some matches that are stored within a dataframe.

For example -

input_string = "Whats your name and Where're you from"

I need to replace part of this string from a data frame. Say the data frame is

matching <- data.frame(from_word=c("Whats your name", "name", "fro"),
            to_word=c("what is your name","names","froth"))

Output expected is what is your name and Where're you from

Note -

  1. It is to match the maximum string. In this example, name is not matched to names, because name was a part of a bigger match
  2. It has to match whole string and not partial strings. fro of "from" should not match as "froth"

I referred to the below link but somehow could not get this work as intended/described above

Match and replace multiple strings in a vector of text without looping in R

This is my first post here. If I haven't given enough details, kindly let me know

Community
  • 1
  • 1
Sri
  • 43
  • 3

3 Answers3

1
toreplace =list("x1" = "y1","x2" = "y2", ..., "xn" = "yn")

function have two arguments xi and yi.

xi is pattern (find what),
yi is replacement (replace with).

input_string = "Whats your name and Where're you from"
toreplace<-list("Whats your name" = "what is your name", "names" = "name", "fro" = "froth")
gsubfn(paste(names(toreplace),collapse="|"),toreplace,input_string)
Aleksandr
  • 1,814
  • 11
  • 19
  • 1
    Thanks Aleksandr Voitov for your response. I think the "fro" in "from" is being replaced as well. This is the answer I got what is your name and Where're you **frothm** – Sri Mar 24 '17 at 12:22
  • 1
    No problem @Sri. It's quite useful functionality and it prevents from using gsub() multiple times when you want to replace specific string with something meaningful. – Aleksandr Mar 24 '17 at 12:27
  • Oh sorry, I wasnt probably clear with my comment. In your code, the last **from** should not become **froth**. Is there a way to prevent that please – Sri Mar 24 '17 at 12:29
  • 1
    In your question you mentioned that you want to replace "fro" to "froth". So what is the real replacement should be? – Aleksandr Mar 24 '17 at 12:32
  • Sorry, I had said it should not match. Output expected is **what is your name and Where're you from**. Also, if the matching data frame runs to 2 million rows, will this perform good – Sri Mar 24 '17 at 12:34
  • I made a change in this answer. You should test and find out but I suspect it will perform well. – Aleksandr Mar 24 '17 at 12:38
  • Aleksandr, Please read my note#2. It should match and replace "fro" to "froth" if the input string contains "fro". But it should not match "fro" partially from "from" and modify it as "frothm". You have now completely removed that from to_replace list, which isnt correct, as it will not now replace the word "fro" at all to "froth". Hope I am clear – Sri Mar 24 '17 at 12:46
  • Now the output looks like: "what is your name and Where're you frothm" – Aleksandr Mar 24 '17 at 12:48
  • No. It should match "fro" only if the input string contains the complete word as "fro". Not when it is part of the input string as in "from" or "frost" etc., It should be a whole word match. Your code does a partial match – Sri Mar 24 '17 at 13:26
1

Edit

Based on the input from Sri's comment I would suggest using:

library(gsubfn)
# words to be replaced
a <-c("Whats your","Whats your name", "name", "fro")
# their replacements
b <- c("What is yours","what is your name","names","froth")
# named list as an input for gsubfn
replacements <- setNames(as.list(b), a)
# the test string
input_string = "fro Whats your name and Where're name you from to and fro I Whats your"
# match entire words
gsubfn(paste(paste0("\\w*", names(replacements), "\\w*"), collapse = "|"), replacements, input_string)

Original

I would not say this is easier to read than your simple loop, but it might take better care of the overlapping replacements:

# define the sample dataset
input_string = "Whats your name and Where're you from"
matching <- data.frame(from_word=c("Whats your name", "name", "fro", "Where're", "Whats"),
                       to_word=c("what is your name","names","froth", "where are", "Whatsup"))

# load used library
library(gsubfn)

# make sure data is of class character
matching$from_word <- as.character(matching$from_word)
matching$to_word <- as.character(matching$to_word)

# extract the words in the sentence
test <- unlist(str_split(input_string, " "))
# find where individual words from sentence match with the list of replaceble words
test2 <- sapply(paste0("\\b", test, "\\b"), grepl, matching$from_word)
# change rownames to see what is the format of output from the above sapply
rownames(test2) <- matching$from_word
# reorder the data so that largest replacement blocks are at the top
test3 <- test2[order(rowSums(test2), decreasing = TRUE),]
# where the word is already being replaced by larger chunk, do not replace again
test3[apply(test3, 2, cumsum) > 1] <- FALSE

# define the actual pairs of replacement
replacements <- setNames(as.list(as.character(matching[,2])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1]),
                         as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1])

# perform the replacement
gsubfn(paste(as.character(matching[,1])[order(rowSums(test2), decreasing = TRUE)][rowSums(test3) >= 1], collapse = "|"),
       replacements,input_string)
ira
  • 2,542
  • 2
  • 22
  • 36
  • Thanks @ira. Two points I had noticed in your code, 1. I tested with matching data frame having 4500+ rows. My loop way executed in 0.2 seconds and the above code took 0.4 seconds. and 2. I think your code expects the string in the same order as that of a. For instance, if I give the input string as "to and fro is the name" - your code replaces fro to froth but doesnt replace name to names. I think it is because of the ordering? I am not sure. – Sri Mar 24 '17 at 17:03
  • 1
    @Sri the error is because in the code, I did not bother to verify if the entire pattern is matched, or just part of it. But then I have now suggested much more elegant way in line with Alekandr Voitov's answer, which should fix the issue from his answer. – ira Mar 24 '17 at 20:03
  • Thank you @ira. It works. I must add though, it works if the replacement "from" and "to" has small number of rows. But when I tried with replacement having 60,000 rows, it errors out unable to compile the regular expression. So for now, I am going to go ahead with my solution itself using the loop, until someone makes it better (by removing the loop) – Sri Mar 25 '17 at 07:14
  • @Sri I see... in that case bear in mind that your loop depends on ordering of the a vector, the largest replacements must come first, otherwise it also wont replace correctly – ira Mar 25 '17 at 08:00
  • Suggest you use the boundary match `\\b` instead of `\\w*` – G. Grothendieck Mar 26 '17 at 13:55
0

Was trying out different things and the below code seems to work.

a <-c("Whats your name", "name", "fro")
b <- c("what is your name","names","froth")
c <- c("Whats your name and Where're you from")

for(i in seq_along(a)) c <- gsub(paste0('\\<',a[i],'\\>'), gsub(" ","_",b[i]), c)
c <- gsub("_"," ",c)
c

Took help from the below link Making gsub only replace entire words?

However, I would like to avoid the loop if possible. Can someone please improve this answer, without the loop

Community
  • 1
  • 1
Sri
  • 43
  • 3