3

I am using SOUNDEX & DIFFERENCE functions to do some analysis on the data present in the table.

But this function fails at below type of data. The ITEM TYPE & ITEM SIZE are completely different.

SELECT SOUNDEX('ITEM TYPE'), SOUNDEX('ITEM SIZE')

op:-

I350    I350

For DIFFERENCE op: - 4

I understand every analysis that human mind do can not be coded, still I would like to ask, are there exists any other functions in SQL Server that will help me out on my next level analysis ?

Aditya
  • 2,299
  • 5
  • 32
  • 54
  • `SOUNDEX` wasn't just made up by the SQL Server developers, it's a "well known" [function](https://en.wikipedia.org/wiki/Soundex) – Damien_The_Unbeliever Apr 13 '17 at 10:05
  • You could try looking into the Jaro Winkler algorithm, e.g. http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/ – Richard Hansell Apr 13 '17 at 10:14
  • @Damien_The_Unbeliever, this is not the answer to my concern :( – Aditya Apr 13 '17 at 10:16
  • You will probably find this helpful: https://www.archives.gov/research/census/soundex.html. – Gordon Linoff Apr 13 '17 at 10:42
  • You're using functions to do analysis. That's probably the most generic statement you could make here. What is your concern? That `soundex` (which is essentially a hashing algorithm) returns the same value for two different inputs? Hashes exhibit that behavior. – Ben Thul Apr 15 '17 at 04:12
  • [Soundex](https://en.wikipedia.org/wiki/Soundex) is a terrible algorithm for anything in the real world. It was developed to try and find matches for surnames, but it reduces everything to too small a range. I suspect the only reason it's still around is because Knuth thought it was an interesting algorithm, intellectually. Difference is a bit better, but also not great. – brianary Jan 22 '20 at 17:46

1 Answers1

5

You can use an algorithm, such as Damerau–Levenshtein distance.

The Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

There are T-SQL implementations, such as this one by Steve Hatchett. Alternatively, you can use an implementation in C#, compile a DLL and load it into SQL CLR. Compiled version should be faster.

More info on loading CLR assemblies into SQL @ CLR Assembly C# inside SQL Server.

Community
  • 1
  • 1
Serge
  • 3,986
  • 2
  • 17
  • 37