4

I have a table where a column has some misspelled strings, lets say as an example:

table$Status returns these values:

"alive" "sic" "alive" "sick" "alive" "si" "alive" "ali"   "alv"  
"dead" "alive" "alive" "alive" "al"    "dead"  "dead"  "de"    "dead" 
"dead"  "dea"   "dead"  "al"   "dead"  "de"    "al"  "de"    "sick" 
"dead"  "alive"

I want to have alive, sick or dead like the following example:

"alive" "sick" "alive" "sick" "alive" "sick" "alive" "alive"   "alive"  
"dead" "alive" "alive" "alive" "alive"    "dead"  "dead"  "dead"    "dead" 
"dead"  "dead"   "dead"  "alive"   "dead"  "dead"    "alive"  "dead"    "sick" 
"dead"  "alive"

I know there is this function from the package RecordLinkage to get the distance between strings like:

levenshteinSim("al", "alive")

So i will be comparing every single value with another and get the best similarities, also I know by using table(Table$Status) I will get the number of the most repeated values and those will be the correct.

But here is my question how can I compare them all with each other and replace my table?? If someone knows an easy way to do it would be really helpful.

loki
  • 9,816
  • 7
  • 56
  • 82
Programmer Man
  • 1,314
  • 1
  • 9
  • 29

1 Answers1

1
library(data.table)
library(dplyr)
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si",  "de",   "al"  ))
table[,Status2:=ifelse(Status%like%"^al","alive",
                      ifelse(Status%like%"^si","sick","dead"))]

UPDATE

A more general solution:

library(data.table)

table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si",  "de",   "al"  ))

correct_values <- c("alive","sick","dead")
for (i in 1:nrow(table)){ # i <- 2
  string <- table[i,Status]
  max <- 0
  similarity <- 0
  for(j in correct_values){ # j <- "alive"
    similarity <-   length(Reduce(intersect, strsplit(c(string, j), split = "")))
    if(similarity > max){
      max <- similarity
      to_replace <- j
    }
  }
  table[i,"Status"] <- to_replace
}

Here I assume that you know which values are the corrects ones (hereby you input manually the correct_values. This it will substitute the value in column Status with the value from the correct_values that have the highest number of common characters.

I hope it helps !

quant
  • 4,062
  • 5
  • 29
  • 70
  • This works but it's very specific for my example what happens when I have a table with 10000 values how can I know that those are the misspelled words? – Programmer Man Jul 24 '17 at 13:00
  • @quant I would advise using `dplyr::case_when` over nested `ifelse`. @ProgrammerMan If it less specific, there is no way to determine what 'al' means. Is it 'alive' or 'all'? Maybe 'ale'? Ofc you should use fuzzy matching over first symbols, but still you must provide pattern of full words for comparison. – Andrey Kolyadin Jul 24 '17 at 13:10
  • @quant Thank you so much this works!! – Programmer Man Jul 25 '17 at 07:27