Compare addresses that have unit vs # but are actually the same

Question

I am trying to retrieve records in which address 1 <> address 2.

I ran into a situation where there are addresses that look like this example:

address 1                   address 2
69 Cherokee Cir Unit 204    69 CHEROKEE CIR # 204

My query returns records that fall into this case, but this is incorrect since the addresses are actually the same.

Would it be possible to get an example how this can be addressed?

The address value for both address 1, and address 2 are full blocks, not stored in multiple columns.

Address manipulation is extremely problematic at best. You could use REPLACE in your example here to make this comparison. But in a case sensitive collation these are still not the same. And if they are not **exactly** the same other than Unit and # they won't match. — Sean Lange, Nov 26 '19 at 17:43
There is a nice Wikipeadia article about this problem: https://en.wikipedia.org/wiki/Postal_address_verification — Piotr Palka, Nov 26 '19 at 17:44

score 0 · Answer 1 · answered Nov 26 '19 at 17:47

0

Here is how you could use REPLACE for this. But even an extra space in either address means they won't match.

declare @Something table
(
    address1 varchar(50)
    , address2 varchar(50)
)

insert @Something values
('69 Cherokee Cir Unit 204', '69 CHEROKEE CIR # 204')

select *
from @Something s
where s.address1 = replace(s.address2, '#', 'Unit')

answered Nov 26 '19 at 17:47

Sean Lange

33,028
3
25
40

Thank you Sean, this is a good starting point. I have been struggling to filter correctly the results that I am going crazy. – erasmo carlos Nov 26 '19 at 17:59

LukStorms · Accepted Answer · 2019-11-28T10:00:47.747

Here's a method to count the different words. But it uses the STRING_SPLIT function that was only added since Sql Server 2017.

select t.*
from YourTable t
cross apply 
(
  select
   count(nullif(s1.value,' ')) as diffcnt1
  ,count(nullif(s2.value,' ')) as diffcnt2
  --,string_agg(nullif(s1.value,' '),'|') as diff1,
  --,string_agg(nullif(s2.value,' '),'|') as diff2
  from string_split(t.address1,' ') s1
  full join string_split(t.address2,' ') s2
    on lower(s1.value) = lower(s2.value)
  where s1.value is null
     or s2.value is null
) ca
where t.address1 != t.address2
  and (ca.diffcnt1>1 or ca.diffcnt2>1)

A test on db<>fiddle here

Or try this quick & dirty method to filter at least some that are relatively similar.
By also checking the DIFFERENCE between the strings.

SELECT *
FROM YourTable
WHERE (LOWER(address1) != LOWER(address2)) 
  AND DIFFERENCE(LOWER(address1), LOWER(address2)) < 4

But DIFFERENCE just compares the SOUNDEX values of the strings. So there's a probability of getting false-positives or false-negatives.

Perhaps it's more accurate to check the Levenshtein distance between the strings.
Here's an old SO post about calculating that metric in Sql Server: Levenshtein distance in T-SQL

And there exist other string metrics.

Well, it just calculates a number between 0 and 4 to indicate the degree of similarity. And 4 indicates strongly similar. If he wants to find different strings, but avoid those that are to similar, then filtering those with difference 4 isn't really a perfect solution. But maybe the OP might find this concept still helpfull. — LukStorms, Nov 26 '19 at 19:49
No problemo. A time ago I did the reverse, and read your answer wrong. It happens. ;) — LukStorms, Nov 26 '19 at 20:40

Compare addresses that have unit vs # but are actually the same

2 Answers2