2

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.

How can I search a string, a word that matches and replace it?

The expected result is the following example:

a1<- c(" the classroom is ful ")
a2<- c(" full")

In this case I would be replacing ful for full in a1

Max TC
  • 79
  • 7
  • 2
    Do you already know how the words are misspelled? – Lamia Dec 04 '17 at 19:53
  • 1
    https://stackoverflow.com/questions/41463365/replace-a-list-of-words-occuring-in-sentences-in-r – M-- Dec 04 '17 at 20:13
  • 1
    This does not seem trivial. You want to make sure not to correct false positives such as _bashful_ or _fulsome_, and you need to handle cases where "ful" may be the first or last word in a sentence, be trailed by a comma or other punctuation, and so on. – Stuart Allen Dec 04 '17 at 20:21
  • 1
    On one hand it's solvable by using dplyr's recode() function, if you are looking for a full-service spellcheck see the hunspell package – Robert Tan Dec 04 '17 at 21:24

4 Answers4

4

Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.

library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
#  [1] "fool" "flu"  "fl"   "fuel" "furl" "foul" "full" "fun"  "fur"  "fut"  "fol"  "fug"  "fum" 

So even in your example, would you want to replace ful with full, or many of the other options here?

The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.

library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "

But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.

a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3, 
                paste("\\b", 
                      hunspell(a3)[[1]], 
                      "\\b", 
                      collapse = "", sep = ""), 
                hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "

Update

Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.

library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"

Update 2

Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".

Your example:

a <- c(" thin, thic, thi") 
badwords.corpus <- c("thin", "thic", "thi" ) 
goodwords.corpus <- c("think", "thick", "this")

The solution is to modify badwords.corpus

badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"

Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.

vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus

str_replace_all(a, vect.corpus)
# [1] " think, thick, this" 
Eric Watt
  • 3,180
  • 9
  • 21
  • thanks Eric, I already have my dictionary of words, one where the words are found with error and another where they are written correctly, the problem is that the function runs through the string and check what is the word that is wrong compared to my dictionary and so make the replacement – Max TC Dec 04 '17 at 21:32
  • I tried with **stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)** but it does not make any change – Max TC Dec 04 '17 at 21:33
  • Hi @MaxTC, I updated the answer given you already have a dictionary of words. Does this work for you? – Eric Watt Dec 04 '17 at 21:49
  • I work perfectly at the beginning @EricWatt, the incidence in this situation is that if he finds another similar word he replaces again and changes the word again, that is, he divides the string and makes the substitution again – Max TC Dec 07 '17 at 21:02
  • I'm not sure I follow. Can you give an example? – Eric Watt Dec 07 '17 at 21:42
  • `a<- c(" thin, thic, thi") badwords.corpus <- c("thin", "thic", "thi" ) goodwords.corpus <- c("think", "thick", "this") ` in the first entry, replace **thin** by **think**, but find **thi** and result is **thickk** again, and then replace again, and modify the word – Max TC Dec 07 '17 at 22:21
  • what I'm looking for is that if a string is already correct, it will no longer be modified if it finds a similar pattern – Max TC Dec 08 '17 at 19:24
  • @MaxTC try the solution in Update 2, which address the example you gave. – Eric Watt Dec 09 '17 at 01:38
0

I think the function you are looking for is gsub():

gsub (pattern = "ful", replacement = a2, x = a1)
tobiaspk1
  • 378
  • 1
  • 11
  • I try, the question is that the amount of the list of words or synonyms is very large, within a dataframe I imagined it would be possible to make the comparison of each word with respect to the sentences, and when I found a similar word it would be done the replacement – Max TC Dec 04 '17 at 20:03
  • One should also take into acccount words: to replace ful, but not thankful. – Heikki Dec 04 '17 at 20:33
0

Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.

library(gsubfn)

L <- list(ful = "full")  # can add more words to this list if desired

gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

For a kind of ordered replacement, you can try this

a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")

qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)

For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example

a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"

library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
  patt <- paste0('\\b', badword, '\\b')
  repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
  final.word <- ifelse(is.na(repl), badword, repl)
  a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"
nghauran
  • 6,648
  • 2
  • 20
  • 29