1

Possible Duplicate:
R: How to measure similarity between strings?

I have been working on a large dataset. I need to find potential duplications - similar name such as:

NewYork, new york, New York, Naw York, Niy Work 

Thus I thought the following rules can help to catch such potential duplications:

If any three consiquitive characters match: Issue: Then it would detect following as potential duplications, in real sense they are not. fate late mate rate If become more conservative that I might need 4 consequtive characters, then I might have problem with short words.

Are there is any smart way to find typo type of duplications?

Consider the folllowing small example:

myfruits <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", 
"Blackberry", "Blackcurrant",    "Blueberry", "Currant", 
"Cherry", "Cherimoya", "Clementine", "Aple", "Binana", "BlaCkbarry",
"pricot")

Speller error but are in fact duplications in the above list:

 "Apple" & "Aple",
"Banana" &  "Binana", 
"Blackberry" & "BlaCkbarry", 
"Apricot" &  "pricot"
Community
  • 1
  • 1
fprd
  • 621
  • 7
  • 21
  • maybe `?agrep` is useful to you. – johannes Jul 05 '12 at 11:01
  • 2
    If you need to do a lot of this maybe check out [Google Refine](http://code.google.com/p/google-refine/) which is designed to clean up messy data. – mindless.panda Jul 05 '12 at 11:01
  • 2
    I asked a very similar question a year ago and got a very nice answer and short answer using `agrep` http://stackoverflow.com/q/6044112/567015 – Sacha Epskamp Jul 05 '12 at 11:16
  • 1
    This really is not an `R` question. It is, and has been for a long time, a very difficult problem to solve properly. Just take a look at the duplicate junk mail you get at home :-( . Or, consider that "APLE" is a very common initialization, so unless your software knows in advance that you're looking for fruit, it can't know whether to change "APLE" or not. For that matter, maybe it should become "ape." In short, you have jumped feet first into a major slime pit! – Carl Witthoft Jul 05 '12 at 11:43

0 Answers0