I am not sure if you need clustering methods.
What you do need is some distance measure between strings (e.g. Levenshtein). You most likely would also need some sort of lexicon. But this had the same downsides as the approach I suggest below. If you had some correct words which had the same distance to a misspelt word, e.g. your lexicon contains "car", "cart", "card" and you would have the misspelt word "carx" you could not decide which is the right one. You would need context information (a car is more expensive than a card e.g.).
I will use frequencies of words. I think if you want to show that your problem is generally solvable this should suffice. My idea is that a misspelt word occurs rarely. A rarely occuring correct word could cause problems in my approach but only if other correct words with a small distance to it are present.
I will use the adist
function in the stringdist
package.
require (stringdist)
My example is
words <- c("monitor", "laptop", "mouse", "mouse", "keybsard", "monitor",
"mous", "keyboard", "keyboard", "monitor", "keybxard", "monitor",
"motse", "monitod", "laptop", "keyboard", "laptop", "mousq",
"laptop", "mobitor", "keybolrd", "monitor", "mouse", "laptop",
"monitor", "moute", "mouwe", "mwuse", "tonitor", "ltptop", "keybovrd",
"monitor", "laptop", "moase", "keyboard", "keyboard", "keywoard",
"laptnp", "laptop", "laptop")
Lets look at the frequencies:
freq_table <- as.data.frame(table(words), stringsAsFactors = F)
It looks like this:
words Freq
1 keyboard 5
2 keybolrd 1
3 keybovrd 1
4 keybsard 1
5 keybxard 1
6 keywoard 1
7 laptnp 1
8 laptop 8
9 ltptop 1
10 moase 1
11 mobitor 1
12 monitod 1
13 monitor 7
14 motse 1
15 mous 1
16 mouse 3
17 mousq 1
18 moute 1
19 mouwe 1
20 mwuse 1
21 tonitor 1
Now I separate in 'good' and 'bad' words:
s <- split(freq_table, freq_table$Freq < 3)
good <- s[['FALSE']]
good <- good[order(-good$Freq),]$words
bad <- s[['TRUE']]$words
What I have done is splitted the frequency table in entries which occur 3 or more times and entries that occur less than 3 times. I will explain later why I sorted the good ones. Now we have good:
[1] "laptop" "monitor" "keyboard" "mouse"
and bad:
[1] "keybolrd" "keybovrd" "keybsard" "keybxard" "keywoard" "laptnp" "ltptop" "moase"
[9] "mobitor" "monitod" "motse" "mous" "mousq" "moute" "mouwe" "mwuse"
[17] "tonitor"
Now I calculate the distance matrix between the good and the bad words:
dis <- adist(bad,good)
and look if there are good words in the 'neighbourhood' of bad words.
hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < 3)[1])
I always take the first hit. Since we sorted the good words before, the first hit would be the most frequent word amongst the hits. In that way I want to avoid that a word which is misspelt, but often in the same way, is used as a correct word. This will not work always, its a heuristic.
Now I generate some sort of look up table df
:
bad_corr <- bad
ind <- which(!is.na(hits))
bad_corr[ind] <- good[hits[ind]]
df <- data.frame(bad, bad_corr, stringsAsFactors = F)
It looks like:
bad bad_corr
1 keybolrd keyboard
2 keybovrd keyboard
3 keybsard keyboard
4 keybxard keyboard
5 keywoard keyboard
6 laptnp laptop
7 ltptop laptop
8 moase mouse
9 mobitor monitor
10 monitod monitor
11 motse mouse
12 mous mouse
13 mousq mouse
14 moute mouse
15 mouwe mouse
16 mwuse mouse
17 tonitor monitor
This I use to replace the misspelt words. Summarized, the whole function is
this:
correct <- function (words, minfreq = 3, sensitivity = 3) {
freq_table <- as.data.frame(table(words), stringsAsFactors = F)
s <- split(freq_table, freq_table$Freq < minfreq)
good <- s[['FALSE']]
good <- good[order(-good$Freq),]$words
bad <- s[['TRUE']]$words
dis <- adist(bad,good)
hits <- sapply(1:NROW(dis), function (i) which(dis[i,] < sensitivity)[1])
bad_corr <- bad
ind <- which(!is.na(hits))
bad_corr[ind] <- good[hits[ind]]
df <- data.frame(bad, bad_corr, stringsAsFactors = F)
ind <- match(words, df$bad)
words[which(!is.na(ind))] <- df$bad_corr[ind[!is.na(ind)]]
words
}
sensitivity
says, how 'far away' a misspelt word is allowed to be. minfreq
means that every word that occurs less than minfreq
times is seen as possibly misspelt (but will only be replaced if there is a more frequent word with string distance smaller than sensitivity
). However this function is not perfect. If you dont have misspelt words at all for example it will cause an error. So if you want to use it you should refine it further.
The result of all this:
words correct.words.
1 monitor monitor
2 laptop laptop
3 mouse mouse
4 mouse mouse
5 keybsard keyboard
6 monitor monitor
7 mous mouse
8 keyboard keyboard
9 keyboard keyboard
10 monitor monitor
11 keybxard keyboard
12 monitor monitor
13 motse mouse
.. ...... ........
Finally if you had a taxonomy, you would omit the frequency part and store the words of the taxonomy in good
.
Good Luck!