0

I'm looking to somehow use soundex to standardize my data because there are sometimes misspellings in a certain column.

For example, I have a column "City" and data will come through as "Denver", "Denver CO" or Denvr".

Is there a way to have all variations of "Denver" automatically correct to "Denver" ?

HaveNoDisplayName
  • 8,291
  • 106
  • 37
  • 47
user2860150
  • 67
  • 1
  • 6
  • Soundex is a horrific matching tool for most text. – billinkc Apr 30 '15 at 18:22
  • You're looking to standardize your data. Consider Hollywood. Hollywood is a city, but it's a vanity city. The preferred city is Los Angeles but the Post Office will deliver to Hollywood. Nothing's going to standardize that. – billinkc Apr 30 '15 at 18:26
  • @billinkc I don't really need it to be smart, it just needs to find the correct match when its a few characters off. – user2860150 Apr 30 '15 at 19:36
  • The native SSIS approach would be to use a Fuzzy Lookup. Under the covers, it's likely to use advanced matching algorithms like Levenstein distance, ngram analysis, etc. Assuming you have a body of known values, Fuzzy Lookup will attempt to match based on your tolerance for precision. It *is* an Enterprise Edition feature though so be aware of that before you attempt to implement. Will work fine in VS because that is licensed for it but will go belly up in Standard and I *think* BI edition – billinkc Apr 30 '15 at 19:52

3 Answers3

2

If you want ALL variations to match a specific phrase, I'd rather go with string-distance functions, than with soundex.

The simplest one is the Levenstein distance and here's an example implementation in T-SQL.

Levenstain distance is a measure of how many edits need to be performed on one string to get the second string. This function is frequently used in so called fuzzy searches where you're interested in getting results similar, and not exactly matching your search criteria.

Also, since you're working with MS SQL Server, I'd consider using Data Quality Services for cleaning up contents of the database.

Community
  • 1
  • 1
npe
  • 15,395
  • 1
  • 56
  • 55
  • Sorry I'm not really sure how this works. I think I need some sort of view to process this data and the output being the corrected "City". Do I create a view with this code in it? – user2860150 Apr 30 '15 at 18:25
  • For every value in your table, you'd need to query other rows for similarity. Then you need to decide which way to clean this up. Levenstein algorithm is just a tool, it's up to you how to use it. If you want to clean up entire database, seriously consider using DQS instead. – npe Apr 30 '15 at 18:27
  • does this function actually make the correction for me? I need something to actually make the correction if that makes sense. – user2860150 Apr 30 '15 at 18:28
  • The function returns the _edit distance_ between two strings. It does not change the data. It's up to you to decide how different the strings can be to still consider them as _"matching"_. You need to run the UPDATE yourself. I guess what you want to do will require quite a lot of coding if done manually. Try learning some more about Data Quality tools. There are solutions for your problem - both open source and commercial. DQS is part of SQL Server already. – npe Apr 30 '15 at 18:32
1

You might look into MAster Data Management. I think there is something in 2008 that you can use in that area. Otherwise you would need to upgrade to use it.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
0

If this is specifically for address information - there are some third party plugins that will do an address lookup - this is potentially a better way to clean your data. I have used used CozyRoc Address lookup with success

CPU_BUSY
  • 781
  • 2
  • 5
  • 15
  • It just needs to be the city column. Literally just need: When City = 'Denvr', set to'Denver' but for many instances of misspellings. – user2860150 Apr 30 '15 at 18:25