Correct the misspelling for the part that is misspelled (even if the part is within a whole word)

Question

I want to replace misspelling for only the misspelled part. Here is example code. The first bit is setting up a reference dataframe with the wrong and correct spelling.

library(stringr)
corrected <- data.frame(stringsAsFactors=FALSE,
                        Wrong_spell = c("abdmen", "abdomane", "abdome", "abdumen", "abodmen",
                                        "adnomen", "aabdominal", "abddominal"),
                        Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", "abdomen",
                                          "abdomen", "abdominal", "abdominal") )

these are the elements to be corrected

reported <- c("abdmen pain", "abdomane pain", "abdumenXXX pain")

When I run this code I get the resulting 3 elements

regex_pattern <- setNames(corrected$Correct_spell, paste0("\\b", corrected$Wrong_spell, "\\b"))
str_replace_all(reported, regex_pattern)

> str_replace_all(reported, regex_pattern)
[1] "abdomen pain"    "abdomen pain"    "abdumenXXX pain"

I would like the code to just replace the part that matches the misspelling, so the third element to becomes "abdomenXXX pain". It corrected the first two, but the third element is unchanged. The code only looks at whole words within the element. Not sure it's possible, but if you have any ideas or potential fixes, please point me where I need to look. Any help greatly appreciated. Thanks in advance.

You might take a look here: https://stackoverflow.com/a/47642123/15293191 — AndrewGB, May 02 '22 at 04:17

jpsmith · Answer 1 · 2022-05-02T03:37:38.533

2

The following works on your example data, but not sure if it will work for your real dataset. Since the only difference in "abdomen" and "abdominal" is "al", I just checked for any of the wrong spellings from corrected$Wrong_spell (minus the "al" in abdominal):

str_replace_all(reported, 
                paste(gsub("al", "",corrected$Wrong_spell), collapse ="|"), 
                "abdomen")

Output:

[1] "abdomen pain"    "abdomen pain"    "abdomenXXX pain"

edited May 02 '22 at 03:37

answered May 02 '22 at 02:48

jpsmith

11,023
5
15
36

thanks for the effort. You are right, not exactly what I was looking for as it may only work for this case, but it does however give me some ideas to how I may get there. Thus, many thanks – H.Cheung May 02 '22 at 04:24
Yea I figured it probably wasn’t generalizable - FYI I also replaced “\\b” with “(?!s)” in the regex and it worked but added an extra “n” to the first two positions. Very unclear why, but something to look into as well. Good luck! – jpsmith May 02 '22 at 04:29
1

Yeah I had noticed. I was hoping some magical function could do it all. It looks like I'm going to be writing an inelegant loop of some sort lol. Thanks again! – H.Cheung May 02 '22 at 04:36

score 1 · Answer 2 · answered May 02 '22 at 05:53

You could use regexpr in outer.

f <- \(x, y) {
    s <- strsplit(x, '\\s+')
    k <- outer(y[, 1], sapply(s, `[`, 1), Vectorize(regexpr))
    j <- which(colSums(k == 1) == 1)
    i <- apply(k[, j], 2, which.max)
    s[j] <- Map(`[<-`, s[j], 1, y[i, 2])
    vapply(s, paste, collapse=' ', character(1))
}


f(reported, corrected)
# [1] "abdomen pain1"   "abdominal pain2" "abdomen pain3"   "abdomen pain4"   "abdominal pain5"

*Data:*

corrected <- structure(list(Wrong_spell = c("abdmen", "abdomane", "abdome", 
"abdumen", "abodmen", "adnomen", "aabdominal", "abddominal"), 
    Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", 
    "abdomen", "abdomen", "abdominal", "abdominal")), class = "data.frame", row.names = c(NA, 
-8L))

reported <- c("abdmen pain1", "abddominal pain2", "abdumenXXX pain3", "abdomen pain4", 
"abddominal pain5")

This is brilliant, and very useful. If, ok to ask and if possible for an adjustment. Is there an change to code so the third element to "abdomenXXX pain3". The code you wrote removed the XXX part, when I'd like to replace only the incorrect bit of text. If not, thanks ever so much for this. H — H.Cheung, May 02 '22 at 16:09

H.Cheung · Answer 3 · 2022-05-02T19:11:54.017

I managed to do it, with some inspiration from the other posters. But did so with an inelegant loop.

# Data
corrected <- structure(list(Wrong_spell = c("abdmen", "abdomane", "abdome", 
                                            "abdumen", "abodmen", "adnomen", "aabdominal", "abddominal"), 
                            Correct_spell = c("abdomen", "abdomen", "abdomen", "abdomen", 
                                              "abdomen", "abdomen", "abdominal", "abdominal")), class = "data.frame", row.names = c(NA, -8L))

reported <- c("abdmen pain1", "abddominal pain2", "abdumenXXX pain3", "abdomen pain4", "abddominal pain5",  "pain6 aabdominal")

# Loop
reported_correct <- NULL 
for (i in 1:length(reported)) {
  for (j in 1:nrow(corrected) ) {
    if( unlist( unlist( gregexpr(pattern=corrected[j,1]  ,  reported[i]) )  )>0) {
      change <- str_replace_all( reported[i] , corrected[ j , 1] , corrected[ j , 2  ] )
      reported_correct <- c(reported_correct ,  change )
    }
  }
}

reported_correct
> reported_correct
[1] "abdomen pain1"    "abdominal pain2"  "abdomenXXX pain3" "abdomenn pain4"   "abdominal pain5"  "pain6 abdominal"

Indeed, not pretty but does the trick. I'm going to post this in another question, to see if it can be done quicker manner, e.g. for longer lists. Thanks

EDIT: I can see why this doesn't work and my original question is flawed. Any word that is subset of the correct word will also be changed to then become an incorrect worde. For example correct "abdomen pain4" became "abdomenn pain4". In theory I did what wanted, but I can see that my question was flawed. Otherwise I'd need to remove misspellings which are subsets of correct spelling.

Correct the misspelling for the part that is misspelled (even if the part is within a whole word)

3 Answers3