Here's a method to count the different words.
But it uses the STRING_SPLIT
function that was only added since Sql Server 2017.
select t.*
from YourTable t
cross apply
(
select
count(nullif(s1.value,' ')) as diffcnt1
,count(nullif(s2.value,' ')) as diffcnt2
--,string_agg(nullif(s1.value,' '),'|') as diff1,
--,string_agg(nullif(s2.value,' '),'|') as diff2
from string_split(t.address1,' ') s1
full join string_split(t.address2,' ') s2
on lower(s1.value) = lower(s2.value)
where s1.value is null
or s2.value is null
) ca
where t.address1 != t.address2
and (ca.diffcnt1>1 or ca.diffcnt2>1)
A test on db<>fiddle here
Or try this quick & dirty method to filter at least some that are relatively similar.
By also checking the DIFFERENCE between the strings.
SELECT *
FROM YourTable
WHERE (LOWER(address1) != LOWER(address2))
AND DIFFERENCE(LOWER(address1), LOWER(address2)) < 4
But DIFFERENCE
just compares the SOUNDEX
values of the strings. So there's a probability of getting false-positives or false-negatives.
Perhaps it's more accurate to check the Levenshtein distance between the strings.
Here's an old SO post about calculating that metric in Sql Server: Levenshtein distance in T-SQL
And there exist other string metrics.