I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
-
See here for more information on fuzzy text matching: http://stackoverflow.com/questions/5859561/getting-the-closest-string-match – jordanhill123 Jul 13 '14 at 00:42
2 Answers
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.

- 38,138
- 7
- 87
- 101
-
I was going to suggest using soundex ([http://www.techrepublic.com/blog/programming-and-development/how-do-i-implement-the-soundex-function-in-c/656](http://www.techrepublic.com/blog/programming-and-development/how-do-i-implement-the-soundex-function-in-c/656)). After applying soundex you could sort your strings by the soundex codes they produce and flag equivalent codes for review by the user. I think the end result might be similar to using levenshtein distance? – hmqcnoesy Nov 22 '11 at 15:55
-
One thing with soundex is that it is useless when checking strings that contain only digits. – jamiebarrow Jul 16 '12 at 15:11
-
Does anyone know of a pure SQL implementation, I'm looking to use SQL Azure which doesn't make use of C# assemblies. – jamiebarrow Jul 16 '12 at 15:12
-
@jamiebarrow A quick google search turned up this: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html – Fosco Jul 16 '12 at 18:37
-
2This is implemented by [fuzzystring](https://github.com/kdjones/fuzzystring), as well as a host of other algorithms. – jpaugh Jun 15 '17 at 19:05
-
Some more c# Levenshtein implementations...http://www.csharpstar.com/csharp-string-distance-algorithm/ – cymorg Apr 12 '18 at 15:16
-
1
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
Edit: I just noticed a comment about this package, but I think it deserves a better place inside this question

- 1,048,767
- 296
- 4,058
- 3,343

- 1,745
- 2
- 22
- 51