1

My table in SQL server has some entries like shown below.

2934046 Kellogg’s Share Your Breakfast  74672   2407522 Kellogg?s Share Your Breakfast  ACTIVE  2015-09-01  9999-12-31

2934046 Kellogg?s Share Your Breakfast  74672   2407522 Kellogg?s Share Your Breakfast  ACTIVE  2015-09-01  9999-12-31

Another example could be

2939508 UOL Ação Social 81534   1527484 UOL Ac?o Social ACTIVE  2015-09-01  9999-12-31
2939508 UOL Ac?o Social 81534   1527484 UOL Ac?o Social ACTIVE  2015-09-01  9999-12-31

As it can be seen that both the entries are same, except for the question mark character in the second entry. Even if I do something like

SELECT DISTINCT * from my_table

it is not useful. I have to figure out a way to remove such kinds of duplicate entries based on special characters. My manager says that the entries with question marks are basically bad data and I should remove them. Does anyone have an idea how to do so ?

Lenin Mishra
  • 181
  • 2
  • 15
  • The real answer to that problem is to avoid corrupting records containing non-ASCII characters. – roeland Sep 12 '16 at 03:59

2 Answers2

0

You can implement damerau-levenshtein algorithm which evaluates how similar two strings are in a clr project and utilize it for t-sql.

You can experiment with your data to find the proper threshold value in order to accept two strings as duplicates.

A c# example of algorithm implementation can be found here:

Damerau - Levenshtein Distance, adding a threshold

Community
  • 1
  • 1
GeorgeT
  • 484
  • 3
  • 5
0

I have face the same problem this year during data load activity but my manager provide me hint to use SSIS fuzzy grouping transformation to find identical records. Please create small SSIS package , add data flow task. Inside data flow task add (source + fuzzy grouping + destination).

Visit - Adding Fuzzy Group Transform to Identify Duplicates https://msdn.microsoft.com/en-us/library/jj819770(v=sql.120).aspx

DatabaseCoder
  • 2,004
  • 2
  • 12
  • 22