1

I have a database that sometimes stores duplicate rows, however the duplicate is not clear cut, e.g. the following two column values would be a duplicate:

G12345 & G1234 --> because they are very similar 
(a string comparison shows that the characters match 83.3%).

I need some help writing an SQL query that would retrieve values that are very similar to a string sent as part of the query, e.g. over 50% of characters matched.

Can someone help with this? I have a C# method as follows but not quite sure how to accomplish this in SQL:

static double StringCompare(string a, string b)
{
  if (a == b) //Same string, no iteration needed.
    return 100;
  if ((a.Length == 0) || (b.Length == 0)) //One is empty, second is not
  {
    return 0;
  }
  var maxLen = a.Length > b.Length ? a.Length : b.Length;
  var minLen = a.Length < b.Length ? a.Length : b.Length;
  var sameCharAtIndex = 0;
  for (var i = 0; i < minLen; i++) //Compare char by char
  {
    if (a[i] == b[i])
    {
      sameCharAtIndex++;
    }
  }
  return sameCharAtIndex / maxLen * 100;
}

Thanks in advance.

Mr. Radical
  • 1,847
  • 1
  • 19
  • 29
Tommy
  • 445
  • 1
  • 6
  • 15

2 Answers2

0

Use Mysql Like Operator instead of doing in service layer.

SELECT * FROM table WHERE column LIKE 'G12___' or 'G12%'.

SELECT * FROM table WHERE column LIKE '%input string as parameter%'.

The "_" wildcard in LIKE predicates means "one of any character," equivalent to "." in regular expressions.

See this for reference.

0

Not sure if your trying to use SQL-Server or MySQL, but you could create and use the following function in SQL-Server:

create function StringCompare
    (@A nvarchar(200),
    @B nvarchar(200)
    )
returns float
as
begin
    if (
        @A = @B
        or (@A is null and @B is null)
        )
    begin
        return 100.0
    end

    if (
        ((@A is null or len(@A) = 0) and (@B is not null and len(@B) > 0))
        or ((@B is null or len(@B) = 0) and (@A is not null and len(@A) > 0))
        )
    begin
        return 0.0
    end

    declare @maxLen int
    set @maxLen = case when len(@A) > len(@B) then len(@A) else len(@B) end

    declare @minLen int
    set @minLen = case when len(@A) < len(@B) then len(@A) else len(@B) end

    declare @sameCharAtIndex int
    set @sameCharAtIndex = 0

    declare @count int
    set @count = 1

    while (@count <= @minLen)
    begin
        if (SUBSTRING(@A, @count, 1) = substring(@B, @count, 1))
        begin
            set @sameCharAtIndex = @sameCharAtIndex + 1
        end

        set @count = @count + 1
    end

    return cast(@sameCharAtIndex as float) / cast(@maxLen as float) * 100.0

end

which could be used in any statement as follows:

select dbo.StringCompare('test', 'test'), dbo.StringCompare('nope', 'test'),  dbo.StringCompare('partial', 'parsomethingelse')

please note, having a loop like this in sql running on many records can be inefficient. And you may want to consider whether you really have to do it in sql.

Contisma
  • 71
  • 6
  • Thanks for this very useful i will try it out. The search would be limited to only a few records. – Tommy Feb 06 '14 at 09:06