Find all strings that share at least X characters, order by likeness

Question

I'm working on a project that has the names of various drugs. Often, I will find something like Proscratinol and Proscratinol XR (extended release). I would like to find a query to pick up on all the names of this nature so I can put the 'parent' drug in a table and have these 'child' drugs reference it so when I write a query to do drug counts, I'm not double counting Proscratinol because it has an XR, CR, and whatever else version to it. I wrote the following in order to take a stab at it

;with x
as
(
select   drug_name
    from rx
    group by drug_name
)

select distinct *
    from x,x as x2
    where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5)
    and x.drug_name !=x2.drug_name

This will give me a list of all the drugs whose names share the first five letters. Five is completely arbitrary here. What I've got so far does good enough, but I would like to order the results by descending likeness. So I would like to find their X-most characters reading from the left are the same.

e.g. Phenytoin and Phepil would be 3 (their first three letters are the same)

;with x as ( select drug_name from rx group by drug_name )

select   x.drug_name as xDrugName
        ,x2.drug_name as x2DrugName
        ,case when LEFT(x2.drug_name,6) = LEFT(x.drug_name,6) 
            then LEN(left(x.drug_name,6)) else '0' end
    from x,x as x2
    where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5)
    and x.drug_name !=x2.drug_name 
    group by x.drug_name,x2.drug_name

Instead of hard coding an int into the left function in the above query, I need that integer expression to return how many similar characters the two strings share. Any good way to do this?

I can't answer your question, but please use explicit JOINs in your code. Implicit JOINs are being deprecated and can have unwanted results (such as unintended CROSS JOINs) — Melanie, Mar 20 '13 at 15:18
Appreciate the sentiment, but in this case I used a cross join because it was faster and the intent isn't nebulous. — wootscootinboogie, Mar 20 '13 at 15:20
You might try looking into fulltext indexing & queries. I don't know enough about it to say for sure but I think it server your purpose better than what you are asking. — Kenneth Fisher, Mar 20 '13 at 15:22

Gordon Linoff · Accepted Answer · 2013-03-20T21:44:14.470

2

This approach uses a number generator and then just tests the length of overlap:

select x.drug_name, x2.drug_name, MAX(c.seqnum) as OverlapLen
from x cross join
     x x2 cross join
     (select ROW_NUMBER() over (order by (select NULL)) seqnum
      from INFORMATION_SCHEMA.COLUMNS c
     ) c
where LEFT(x.drug_name, c.seqnum) = LEFT(x2.drug_name, c.seqnum) and
      len(x.drug_name) >= c.seqnum and len(x2.drug_name) >= c.seqnum
group by x.drug_name, x.drug_name
order by x.drug_name, OverlapLen desc

This assumes that information_schema.columns has enough rows for the longer drug names.

This joins x to itself and then joins in a list of numbers. The where clause is checking three conditions: (1) that the left part of each drug name is the same up to seqnum; (2) that the length of each drug name is less than or equal to seqnum.

The aggregation then takes each pair and chooses the highest value of seqnum -- this should be the longest substring match.

edited Mar 20 '13 at 21:44

answered Mar 20 '13 at 15:40

Gordon Linoff

1,242,037
58
646
786

I had a chance to test this and it doesn't work, every seqnum was the same. Could you expound on what you were trying to do here? – wootscootinboogie Mar 20 '13 at 21:38
@wootscootinboogie . . . I had the length comparison backwards. It was `<=` but it should be `>=`. – Gordon Linoff Mar 20 '13 at 21:44
I asked this same question some place else and this solution was by FAR the best. bravo :) you deserve more than one upvote – wootscootinboogie Mar 21 '13 at 12:58

score 0 · Answer 2 · answered Jul 01 '13 at 19:50

you want longest common sequence. here is a SQL server implementation:

select dbo.lcs(@string1, @string2), len(@string1), len(@string2)

CREATE FUNCTION [dbo].[LCS]( @s varchar(MAX), @t varchar(MAX) )
RETURNS INT AS
BEGIN
  DECLARE @d varchar(MAX), @LD INT, @m INT, @n INT, @i INT, @j INT, 
    @s_i NCHAR(1), @t_j NCHAR(1)

  SET @n = LEN(@s)
  IF @n = 0 RETURN 0

  SET @m = LEN(@t)
  IF @m = 0 RETURN 0

  SET @d = REPLICATE(CHAR(0),(@n+1)*(@m+1))

  SET @i = 1
  WHILE @i <= @n BEGIN
    SET @s_i = SUBSTRING(@s,@i,1)

    SET @j = 1
    WHILE @j <= @m BEGIN
      SET @t_j = SUBSTRING(@t,@j,1)

      IF @s_i = @t_j

        SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,
          NCHAR(UNICODE(
            SUBSTRING(@d, (@j-1)*(@n+1)+@i-1+1, 1)
            )+1))

      ELSE

        SET @d = STUFF(@d,@j*(@n+1)+@i+1,1,CHAR(dbo.Max2(
          UNICODE(SUBSTRING(@d,@j*(@n+1)+@i-1+1,1)),
          UNICODE(SUBSTRING(@d,(@j-1)*(@n+1)+@i+1,1)))))

      SET @j = @j+1
    END
    SET @i = @i+1
  END      

  SET @LD = UNICODE(SUBSTRING(@d,@n*(@m+1)+@m+1,1))
  RETURN @LD
END

Find all strings that share at least X characters, order by likeness

2 Answers2

Linked