2

I have column TreeName with a datatype varchar(30) in table that has similar entries that are entered incorrectly in the table like:

abc]

abc,

-ab c

abc

eucalyptus

eucalyptus;

‘eucalyptus

eucalyptu s

We cannot control the entries, but we need to find similar matches and correct them. What would be the best approach to find possible similar matches (SQL Query) in the same column? The query should return that (abc] abc, -ab c) values are similar to abc. Once we find these matches, we update the table with the correct values. Thanks!

user2536008
  • 215
  • 1
  • 5
  • 15

1 Answers1

2

In case you don't have a list of legal entries, there is no EXACT way to do this, but you can implement good heuristics.

First, in case non alphabet chars are irrelevant, you can remove them. From your samples, it seems that you are not looking for spelling mistakes, just random insertions of non alphabet chars. Then, writing the function that removes illegal chars and the a query that groups by the result of that function applied to your values will do the job

If you want to compensate for more complex spelling mistakes, you need much more complex methods.

Fuzzy grouping in SSIS can be used to find rows that belong to the same group. The rest is up to you. It is similar to strict grouping (the one we do in SQL), but tolerates small differences in values.

If you can create a list of legal entries, Levenshtein Distance which Can Be Implemented in SQL can be used to find rows that are close to the legal entries.

Community
  • 1
  • 1
Alireza
  • 5,421
  • 5
  • 34
  • 67
  • 1
    SOUNDEX may possibly help here as well if you a list of legal entries. You might be able to group on SOUNDEX output to help identify similar records. Although fuzzy grouping sounds like it could possibly take care of that. – Mark Sponsler Sep 24 '15 at 20:55
  • We cannot generate a list of legal entries. Also, we cannot determine illegal chars as the column might have entries like eucalyptus (1.0) and eucalyptus(1.0) without the space. We need to find that these are two possible matches. Once we identify them, we consult the geologist to determine the correct name and we update the database based on that. – user2536008 Sep 24 '15 at 21:14
  • So fuzzy grouping ismmy best bet – Alireza Sep 24 '15 at 21:16