Take a look at the hunspell
package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful
with full
, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b
. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b
it anchors the pattern to a word boundary. \\bthin\\b
will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all
.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"