Possible Duplicate:
R: How to measure similarity between strings?
I have been working on a large dataset. I need to find potential duplications - similar name such as:
NewYork, new york, New York, Naw York, Niy Work
Thus I thought the following rules can help to catch such potential duplications:
If any three consiquitive characters match: Issue: Then it would detect following as potential duplications, in real sense they are not. fate late mate rate If become more conservative that I might need 4 consequtive characters, then I might have problem with short words.
Are there is any smart way to find typo type of duplications?
Consider the folllowing small example:
myfruits <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry",
"Blackberry", "Blackcurrant", "Blueberry", "Currant",
"Cherry", "Cherimoya", "Clementine", "Aple", "Binana", "BlaCkbarry",
"pricot")
Speller error but are in fact duplications in the above list:
"Apple" & "Aple",
"Banana" & "Binana",
"Blackberry" & "BlaCkbarry",
"Apricot" & "pricot"