1

I am trying to retrieve records in which address 1 <> address 2.

I ran into a situation where there are addresses that look like this example:

address 1                   address 2
69 Cherokee Cir Unit 204    69 CHEROKEE CIR # 204

My query returns records that fall into this case, but this is incorrect since the addresses are actually the same.

Would it be possible to get an example how this can be addressed?

The address value for both address 1, and address 2 are full blocks, not stored in multiple columns.

Dale K
  • 25,246
  • 15
  • 42
  • 71
erasmo carlos
  • 664
  • 5
  • 16
  • 37
  • 4
    Address manipulation is extremely problematic at best. You could use REPLACE in your example here to make this comparison. But in a case sensitive collation these are still not the same. And if they are not **exactly** the same other than Unit and # they won't match. – Sean Lange Nov 26 '19 at 17:43
  • There is a nice Wikipeadia article about this problem: https://en.wikipedia.org/wiki/Postal_address_verification – Piotr Palka Nov 26 '19 at 17:44

2 Answers2

0

Here is how you could use REPLACE for this. But even an extra space in either address means they won't match.

declare @Something table
(
    address1 varchar(50)
    , address2 varchar(50)
)

insert @Something values
('69 Cherokee Cir Unit 204', '69 CHEROKEE CIR # 204')

select *
from @Something s
where s.address1 = replace(s.address2, '#', 'Unit')
Sean Lange
  • 33,028
  • 3
  • 25
  • 40
  • Thank you Sean, this is a good starting point. I have been struggling to filter correctly the results that I am going crazy. – erasmo carlos Nov 26 '19 at 17:59
0

Here's a method to count the different words. But it uses the STRING_SPLIT function that was only added since Sql Server 2017.

select t.*
from YourTable t
cross apply 
(
  select
   count(nullif(s1.value,' ')) as diffcnt1
  ,count(nullif(s2.value,' ')) as diffcnt2
  --,string_agg(nullif(s1.value,' '),'|') as diff1,
  --,string_agg(nullif(s2.value,' '),'|') as diff2
  from string_split(t.address1,' ') s1
  full join string_split(t.address2,' ') s2
    on lower(s1.value) = lower(s2.value)
  where s1.value is null
     or s2.value is null
) ca
where t.address1 != t.address2
  and (ca.diffcnt1>1 or ca.diffcnt2>1)

A test on db<>fiddle here

Or try this quick & dirty method to filter at least some that are relatively similar.
By also checking the DIFFERENCE between the strings.

SELECT *
FROM YourTable
WHERE (LOWER(address1) != LOWER(address2)) 
  AND DIFFERENCE(LOWER(address1), LOWER(address2)) < 4

But DIFFERENCE just compares the SOUNDEX values of the strings. So there's a probability of getting false-positives or false-negatives.

Perhaps it's more accurate to check the Levenshtein distance between the strings.
Here's an old SO post about calculating that metric in Sql Server: Levenshtein distance in T-SQL

And there exist other string metrics.

LukStorms
  • 28,916
  • 5
  • 31
  • 45
  • Well, it just calculates a number between 0 and 4 to indicate the degree of similarity. And 4 indicates strongly similar. If he wants to find different strings, but avoid those that are to similar, then filtering those with difference 4 isn't really a perfect solution. But maybe the OP might find this concept still helpfull. – LukStorms Nov 26 '19 at 19:49
  • No problemo. A time ago I did the reverse, and read your answer wrong. It happens. ;) – LukStorms Nov 26 '19 at 20:40