Consider the following two strings: applesauce
and apple-sauce
. These are referring to the same object. Thus any record containing these two names would be considered duplicates. However, in R, these are considered as separate levels. Could one use edit distance to quantify how similar these two names are using the stringdist
package?
Asked
Active
Viewed 498 times
0

Steven Beaupré
- 21,343
- 7
- 57
- 77

NebulousReveal
- 562
- 2
- 7
- 19
-
1Or you could use the default `adist()` function. So it's possible to use edit distance, that that often can get messy. If you just want to ignore non-character values such as dashes or other punctuation, then you can use a regular expression to strip those characters out. You need to be much more explicit about what you want to do with your data in order to turn this into a specific programming question. – MrFlick Mar 02 '15 at 02:04
-
1You might also want to look at tools like [OpenRefine](http://openrefine.org/) which can be pretty handy for resolving such issues. – A5C1D2H2I1M1N2O1R2T1 Mar 02 '15 at 02:16
-
You might also look at the RecordLinkages package and the agrep function of base R. For example, agrep("applesauce", "apple-sauce", ignore.case = TRUE, max.distance = 0.4). – lawyeR Mar 02 '15 at 02:44
1 Answers
0
How about this.
"applesauce"==gsub("-","","apple-sauce")
for multiple arguments like "applesauce"=="apple - sauce"
you can used this Replace multiple arguments with gsub