1

I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!

I have a data table with a column of names of organisations, call it Orgs$OrgName. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt) and their corrections in another column (spelling$correct).

I want to find any parts of OrgName strings which match spelling$misspelt and replace just those parts with spelling$correct.

I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.

Example of my data:

                                         OrgName
1:                         WAIROA DISTRICT COUNCIL
2:                         MANUTAI MARAE COMMITTEE
3:                                C S AUTOTECH LTD
4:                  NEW ZEALAND INSTITUTE OF SPORT
5:                                 BRAUHAUS FRINGS
6:   CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION

The lookup table:

    mispelt         correct 
1 ABANDONNED       ABANDONED            
2  ABERATION      ABERRATION            
3  ABILITYES       ABILITIES            
4   ABILTIES       ABILITIES            
5     ABILTY         ABILITY            
6    ABONDON         ABANDON

(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)

UPDATE: Here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?

library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)

write.csv(spelling$mispelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  #print(stri_replace_all_fixed(row, patterns, replacements, 
vectorize_all=FALSE))
  row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% 
"\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,] 
##should return no rows if spelling correction worked
head(chckspellchk)

#OrgNameClean.OrgNameClean
#1:   SMART ENVIORNMENTAL LTD

UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:

> head(spelling[mispelt %like% " ",,])
     mispelt correctspelling 
1: COCA COLA            COCA            
2:   TORTISE        TORTOISE      

> head(spelling[correctspelling %like% " "])
    mispelt correctspelling  
1:   ABOUTA         ABOUT A             
2:  ABOUTIT        ABOUT IT             
3: ABOUTTHE       ABOUT THE             
4:     ALOT           A LOT       
5: ANYOTHER       ANY OTHER             
6:    ASFAR          AS FAR 
SCHNZ
  • 11
  • 1
  • 3

3 Answers3

1

We can use stringi's stri_replace_*_all() to do multiple replacements on a whole string.

library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')

stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)    
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)

Output:

[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"

Some notes:

  • stri_replace_all_fixed replaces occurrences of a fixed pattern matches.

  • stri_replace_all_regex uses a regular expression pattern instead. This allows us to specify word boundaries: \b to avoid substring matches (an alternative to \bword\b is (?<=\W)word(?=\W)).

  • vectorize_all is set to FALSE, otherwise each replacement is applied to a new copy of the original sentence. See details here.

Full sample:

library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 ' SMART ENVIORNMENTAL LTD',
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
  #row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

PS: I've made a separate CSV with a single headerless column for each character vector. But there are many other ways to read a CSV with R and convert the columns to a char vector.

PS2: If you want substring matches, eg. match ENVIORNMENT in ENVIORNMENTAL do not use stri_replace_all_regex() along with word boundaries \b. This is a great tutorial to buff-up your regex skills.

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Thanks! Before I try that, the bits I am missing are 1. how to create 'mistake' and 'corrected' from the two columns my lookup table (i.e. not by using c("","",""...)), 2. and will it work if I just reassign the Orgs data table as a new object, 'string'? – SCHNZ Jun 09 '18 at 02:38
  • Thanks! I've tried that but still no joy, see update to main question. – SCHNZ Jun 10 '18 at 23:55
  • The thing is you want substring matches (see PS2). This is why I have added also a line that uses `stri_replace_all_fixed()` – wp78de Jun 11 '18 at 07:10
  • @wpd78e Thanks! I tried it with the stri_replace_all_fixed line of your code and it still hasn't replaced ENVIORNMENT. Both ENVIORNMENT and ENVIORNMENTAL are in the spelling lookup. – SCHNZ Jun 11 '18 at 21:25
  • Have you tried my updated sample? Does it work? My guess is that you have confused the patterns/replacement CSVs or that the columns of these files contain extra whitespace that inhibits the process. – wp78de Jun 11 '18 at 21:41
  • @wpde78e I repeated exactly what's shown in my question update, but used `row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)` not `row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)`. Didn't work (code ran but left misspellings as is). So the patterns/replacement csvs are correct, so the issue is possibly the latter. I can upload the spelling.csv table but can't see where/how to do that? – SCHNZ Jun 11 '18 at 22:26
  • 1
    You cannot upload it on SO since the site is not intended as a personal coding service. – wp78de Jun 11 '18 at 22:32
0

This answer is potentially too complicated for a new programmer, and I may be writing this more like Python than R (I'm getting a bit rusty on the latter)* but I have a proposed solution for your problem, which isn't trivial by the way. The issues I foresee you having with other solutions you looked at is that they individually only address one small part of the larger puzzle, which is that you need to be able to check every word inside every string against your lookup table. The simplest way I see to do this is to write a number of small functions to do what you need and then use R's family of apply functions to loop through entries and use the functions.

The only other tricky thing here is using an R environment as your lookup table. For whatever reason in R people don't seem to talk much about or really use hash tables (the real name for a lookup table) but they are very common in other languages. Luckily R's environments are actually just an implementation of a C hash table, which is good because hashes are very fast and allow you to directly map one value to another. (More on this here, if interested.)

*I welcome comments or edits from others that would make my answer simpler or more R-idiomatic

# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 'C S AUTOTECH LTD', 
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
                          "Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
                       stringsAsFactors = FALSE)


# Function to convert your data frame to a hash table
create_hash <- function(in_df){
  hash_table <- new.env(hash=TRUE)
  for(i in seq(nrow(in_df)))
  {
    hash_table[[in_df[i, 1]]] <- in_df[i, 2]
  }
  return(hash_table)
}

# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)

# Try it out:
print(spelling_hash[['ABANDONNED']])  # prints ABANDONED

# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the 
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
  x = hash_table[[in_string]]
  if(!is.null(x)){
    return(x)
  }
  else{
    return(in_string)
  }
}

# Finally make a function to break the full company name apart, 
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
  split_string <- strsplit(as.character(bad_string), " ")
  new_split <- lapply(split_string[[1]], apply_hash)
  return(paste(new_split, collapse=' '))
}

# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)
HFBrowning
  • 2,196
  • 3
  • 23
  • 42
  • Also I'm going laugh really hard and then delete my answer if someone suggests a 1-liner from the `tidyverse` that does the same thing. (But I would welcome this) – HFBrowning Jun 07 '18 at 23:04
  • Thanks @HFBrowning, I've given this a go. All works fine up until the last line, which produces this error: Error in strsplit(bad_string, " ") : non-character argument. – SCHNZ Jun 11 '18 at 21:33
  • @SCHNZ I've modified my code to coerce all row entries to character first, try it again and see if it works now – HFBrowning Jun 11 '18 at 22:28
  • Thanks @HFBrowning. Now when running the last line it returns this error: Error in FUN(X[[i]], ...) : attempt to use zero-length variable name. – SCHNZ Jun 11 '18 at 23:34
  • The errors you are getting sound like they are due to improperly formatted inputs. For instance see: https://stackoverflow.com/questions/31385976/error-attempt-to-use-zero-length-variable-name#34167288 which suggests you may have some col names in your data frame that are bad. It's exceptionally difficult to troubleshoot all errors you may have without a reproducible example; and a lot of programming is learning how to fix bugs/search for errors on your own ;) – HFBrowning Jun 11 '18 at 23:55
  • And given the comment by @wp78de it sounds like you are having similar errors with their solution – HFBrowning Jun 11 '18 at 23:59
  • I have ensured that I am reproducing the code in the responses including the column names used in the responses, so it can't be the column names. Potentially something in the data underneath the column names, yes. – SCHNZ Jun 12 '18 at 00:10
0

I came across a similar issue and might have a tidyverse-style solution.

stringr::str_replace_all should let us do multiple replacements using a named vector.

With the lookup data frame of misspelled and corrected values we could turn that into a named vector. Then we could use that named vector as a lookup in str_replace_all.

Here is an example using some of the misspelled and corrected values provided previously.

library(tidyverse)

# load data frame of misspelled and corrected values
foo <- read_csv("mispelt, correct
                ABANDONNED, ABANDONED
                ABERATION, ABERRATION
                ABILITYES, ABILITIES
                ABILTIES, ABILITIES
                ABILTY, ABILITY
                ABONDON, ABANDON
                COCA COLA, COCA            
                TORTISE, TORTOISE
                ABOUTA, ABOUT A             
                ABOUTIT, ABOUT IT             
                ABOUTTHE, ABOUT THE             
                ALOT, A LOT       
                ANYOTHER, ANY OTHER             
                ASFAR, AS FAR", 
                col_types = "c")


# str_replace_all requires a named vector of replacements
# the value of the vector is the correction,
#   while the name of each value is the search string to replace
lookup <- foo$correct
names(lookup) <- foo$mispelt

# data frame to test our lookup named vector
tbl <- tibble(old = foo$mispelt)

# mutating to a new column to show replacement works, 
#   but we could just overwrite the old column as well using mutate
mutate(tbl, new = str_replace_all(old, lookup))

I did not deal with upper or lower case considerations as I'm just demonstrating the named vector usage in str_replace_all and the examples were all upper case. However, regular expressions and/or the regex function could probably help with that if necessary.

Session info:

|package   |loadedversion |
|:---------|:-------------|
|dplyr     |1.0.7         |
|readr     |2.0.0         |
|stringr   |1.4.0         |
|tibble    |3.1.3         |
|tidyverse |1.3.1         |