I'm working on a project that has the names of various drugs. Often, I will find something like Proscratinol and Proscratinol XR (extended release). I would like to find a query to pick up on all the names of this nature so I can put the 'parent' drug in a table and have these 'child' drugs reference it so when I write a query to do drug counts, I'm not double counting Proscratinol because it has an XR, CR, and whatever else version to it. I wrote the following in order to take a stab at it
;with x
as
(
select drug_name
from rx
group by drug_name
)
select distinct *
from x,x as x2
where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5)
and x.drug_name !=x2.drug_name
This will give me a list of all the drugs whose names share the first five letters. Five is completely arbitrary here. What I've got so far does good enough, but I would like to order the results by descending likeness. So I would like to find their X-most characters reading from the left are the same.
e.g. Phenytoin and Phepil would be 3 (their first three letters are the same)
;with x as ( select drug_name from rx group by drug_name )
select x.drug_name as xDrugName
,x2.drug_name as x2DrugName
,case when LEFT(x2.drug_name,6) = LEFT(x.drug_name,6)
then LEN(left(x.drug_name,6)) else '0' end
from x,x as x2
where LEFT(x2.drug_name,5) = LEFT(x.drug_name,5)
and x.drug_name !=x2.drug_name
group by x.drug_name,x2.drug_name
Instead of hard coding an int into the left function in the above query, I need that integer expression to return how many similar characters the two strings share. Any good way to do this?