1

I have a dataset that has three fields: an address, the numbers stripped from the address, and the letters stripped from the address.

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')

I would like to find groups of addresses by similarity within the same stripped numbers. I know how to find the similarity between two text strings...

select *
from #addresses a
left outer join #addresses b on a.address_numbers = b.address_numbers and MDS_DB.MDQ.SIMILARITY(a.address_letters ,b.address_letters , 2, 0, .90) >= .90

...but I'm not sure how to assign an exemplar/grouping code to each address in the original data. The desired results looks like this:

IF OBJECT_ID ('tempdb..#addresses_desired_result') IS NOT NULL
DROP TABLE #addresses_desired_result

create table #addresses_desired_result (
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100),
    address_group varchar(100)
)

insert into #addresses_desired_result
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '123 Something Rd, Somewhere NY 45678'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY', '234 Something Road, Somewhere NY 45678')

select *
from #addresses_desired_result

address_group could be one of the addresses within the group, or it could just be an integer. The goal is to join the distinct list of addresses and exemplars back to a much larger transaction table and group records by the exemplar/group number.

How do I assign an exemplar address/group number to each group of similar addresses within the same stripped numbers?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • Address standardization is a slippery slope. This is why I use the Google API. The initial bulk can be time consuming depending on the volume of data, but the results is worth it. Take a peek at https://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 – John Cappelletti Nov 07 '19 at 14:23
  • Yes - aware of other address standardization methods. For this purpose, I'm fine with the lower quality results of just grouping addresses that are 90%+ similar within the same stripped address numbers. Any thoughts on the join logic that would be necessary to do something like this? – OverflowingTheGlass Nov 07 '19 at 14:27
  • You obviously have the full address in the address_desired_result. Could you not do a Min() OVER() from the parsed out data (partitioning by address_numbers and address_letters) to get your grouping value? You can use the order by to set your formatting preferences. – JJ32 Nov 07 '19 at 14:36
  • Then it becomes a matter of normalizing the words. In the past, I've created a table where (for example) ROAD becomes RD, STREET becomes ST. I've also used a ZIP Code database to validate and normalize City, State, Zip. As you know, cities and towns may have multiple names. For example East Providence is also know as Riverside. – John Cappelletti Nov 07 '19 at 14:36
  • @JJ32 If anything max() not min(). Also the address_numbers is risky. Zip+4 is very inconsistent and would dramatically alter the key. – John Cappelletti Nov 07 '19 at 14:40
  • @JJ32 do you mind writing a code snippet for that? having trouble visualizing. I wouldn't want it actually partitioned out by characters, because this would not group similar addresses (except those that have special characters like a comma, period, etc.). Also open to not partitioning by numbers either and just grouping by full address similarity – OverflowingTheGlass Nov 07 '19 at 14:48
  • @JohnCappelletti - totally agree that there are better ways to do this. unfortunately, constraints are pushing me towards just grouping by similarity without worrying about normalizing names – OverflowingTheGlass Nov 07 '19 at 14:48
  • If you are trying to format the addresses, I would follow other advice such as @JohnCappelletti provided. If all you need is an arbitrary link there are many examples of window functions on StackOverflow (hopefully you can partition by the similarity function); the order by gives you some measure of control over how the grouping should appear. – JJ32 Nov 07 '19 at 14:53
  • You are going to have to normalize names. If you don't then you need to define similar. Sure the human brain can see that "123 Something Rd" and "123 Something Road" is the same thing. But the computer cannot see that easily. And if you get to loose in your comparison then you might get false positives. – Sean Lange Nov 07 '19 at 14:55
  • @JJ32 As I mentioned, I'm not trying to reformat the addresses - I am trying to group them by similarity and then arbitrarily pick one member of the group to represent the group. I know the window functions row_number(), dense_rank(), etc. may help, but I'm not sure how to apply them in this case to get the exemplar address. – OverflowingTheGlass Nov 07 '19 at 14:55
  • @SeanLange - agreed that normalizing would help enormously. however, I'm fine with taking the negatives of a similarity grouping approach due to time/resource constraints. normalizing/geocoding are in the works, but won't help solve my current problem – OverflowingTheGlass Nov 07 '19 at 14:57
  • Also, I could see this having application for strings that aren't addresses that could not be normalized through the methods discussed here. – OverflowingTheGlass Nov 07 '19 at 14:57
  • OK so since you aren't going to do it right what is considered "good enough"? Maybe look at SOUNDEX? It is hard to offer any real help here when the solutions that are accurate are countered with "I don't have time". – Sean Lange Nov 07 '19 at 14:59
  • I'm saying doing it the "right" way is in the works. This is not a personal project, so I don't determine how much time I have. In order to be pragmatic and get some usable results, I identified this path as a possibility. It is an improvement (I think) over the in-place process of simply grouping by the full address with no attempt at combining addresses. I've already laid out in my question the method that is "good enough" for me, with sample input/result data. – OverflowingTheGlass Nov 07 '19 at 15:12
  • Why not just pick one of the two? You already determined how you are going to decide that they are close enough. You will need some way to specify the order of your rows but that shouldn't be a big deal. – Sean Lange Nov 07 '19 at 15:22
  • @SeanLange - That's the crux of my question. Determining an exemplar for a group of similar text strings, possibly partitioned by another field (in this case the stripped numbers). Apologies if it's too simple of a problem to ask here, but it has stumped me for a couple days. Also, I'm not married to my idea - open to other suggestions that can stay within regular SQL and the input data I provided. – OverflowingTheGlass Nov 07 '19 at 15:24
  • It isn't that the problem is too simple. The confusion is that the problem is not clear. But I think I understand what you are after. – Sean Lange Nov 07 '19 at 15:30

1 Answers1

1

To get clarification:

IF OBJECT_ID ('tempdb..#addresses') IS NOT NULL
DROP TABLE #addresses

create table #addresses (
    id int identity(1,1),
    address_numbers varchar(50),
    address_all varchar(100),
    address_letters varchar(100)
)

insert into #addresses
values ('12345678','123 Something Rd, Somewhere NY 45678', 'SOMETHINGRDSOMEWHERENY'),
       ('12345678','123 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY'),
       ('23445678','234 Something Road, Somewhere NY 45678', 'SOMETHINGROADSOMEWHERENY')


select A.address_numbers, A.address_all, A.address_letters, 
  isnull(B.address_all, A.address_all) as address_group
from #addresses A
left join
(
select A.id, B.address_all,
  row_number() over(order by case when B.address_all + ' ' like '% rd %' then 1 when B.address_all + ' ' like '% road %' then 2 end,
    case when B.address_all + ' ' like '% st %' then 1 when B.address_all + ' ' like '% street %' then 2 end) AS RowNr
from #addresses A
  cross join #addresses B
  where left(A.address_all, 5) = left(b.address_all, 5)  --place similarity function here
   and A.id <> B.id
) B on A.id = B.id and B.RowNr = 1

results:

address_numbers address_all                             address_letters             address_group
12345678        123 Something Rd, Somewhere NY 45678    SOMETHINGRDSOMEWHERENY      123 Something Rd, Somewhere NY 45678
12345678        123 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    123 Something Rd, Somewhere NY 45678
23445678        234 Something Road, Somewhere NY 45678  SOMETHINGROADSOMEWHERENY    234 Something Road, Somewhere NY 45678   

I used left(address_all, 5) in place of similarity function, but you can perform any calculation you like.

JJ32
  • 1,034
  • 1
  • 7
  • 24
  • Thank you - I'm looking for any address that has 90% similarity or more with each other AND the same address_numbers to be put into one group. I think this is more in the right direction than I was going, but I'm not sure how to add the similarity portion in, or avoid hard-coding some of the normalization you inserted into the CASE statement. – OverflowingTheGlass Nov 07 '19 at 15:42
  • I see why this is difficult, I think you have to first group together addresses based on if they are similar. Then assign an arbitrary code to that (maybe min(address_all). Then, partition on that. Sorry if the my understanding of the problem is not clear. – JJ32 Nov 07 '19 at 15:43
  • Yep - that describes it pretty well. I've been stumped by how to do that initial grouping. – OverflowingTheGlass Nov 07 '19 at 15:44
  • Bear with me, I think part of the problem is not all addresses will have a similarity, and therefore nothing to "group" on. I tried to make the test for similarity separate, then return one value (as the group) based on sorting preferences. – JJ32 Nov 07 '19 at 16:01
  • Right - most will probably not have a similarity - so for those, that one address itself could be the group. – OverflowingTheGlass Nov 07 '19 at 16:02
  • You might try and see, it will give you very crude values but if all you need is a match this might work for you. – JJ32 Nov 07 '19 at 16:03
  • So this only looks at road/rd and street/st, right? it doesn't incorporate the similarity function at all? appreciate the help either way – OverflowingTheGlass Nov 07 '19 at 16:06
  • 1
    I used left(address_all, 5) in place of the similarity function. – JJ32 Nov 07 '19 at 16:09
  • makes sense - one last question - is there a way to replace the hard coded road/rd and street/st with something that's more dynamic? – OverflowingTheGlass Nov 07 '19 at 16:15
  • Also - this is using desired_results as input data which I think may be skewing the results – OverflowingTheGlass Nov 07 '19 at 16:56