I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!
I have a data table with a column of names of organisations, call it Orgs$OrgName. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt) and their corrections in another column (spelling$correct).
I want to find any parts of OrgName strings which match spelling$misspelt and replace just those parts with spelling$correct.
I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.
Example of my data:
OrgName 1: WAIROA DISTRICT COUNCIL 2: MANUTAI MARAE COMMITTEE 3: C S AUTOTECH LTD 4: NEW ZEALAND INSTITUTE OF SPORT 5: BRAUHAUS FRINGS 6: CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION
The lookup table:
mispelt correct 1 ABANDONNED ABANDONED 2 ABERATION ABERRATION 3 ABILITYES ABILITIES 4 ABILTIES ABILITIES 5 ABILTY ABILITY 6 ABONDON ABANDON
(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)
UPDATE: Here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?
library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)
write.csv(spelling$mispelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
#print(stri_replace_all_fixed(row, patterns, replacements,
vectorize_all=FALSE))
row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+%
"\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,]
##should return no rows if spelling correction worked
head(chckspellchk)
#OrgNameClean.OrgNameClean
#1: SMART ENVIORNMENTAL LTD
UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:
> head(spelling[mispelt %like% " ",,])
mispelt correctspelling
1: COCA COLA COCA
2: TORTISE TORTOISE
> head(spelling[correctspelling %like% " "])
mispelt correctspelling
1: ABOUTA ABOUT A
2: ABOUTIT ABOUT IT
3: ABOUTTHE ABOUT THE
4: ALOT A LOT
5: ANYOTHER ANY OTHER
6: ASFAR AS FAR