0

I am working in SQL Server 2012. In my table, there is a column called St_Num and its data is like this:

St_Num                 status
------------------------------
128 TIMBER RUN DR      EXP
128 TIMBER RUN DRIVE   EXP

Now we can notice that there are spelling variations in the data above. What I would like to do is that if the number in this case 128 and first 3 letters in St_Num column are same then these both rows should be considered the same like this the output should be:

St_Num                status
-----------------------------
128 TIMBER RUN DR     EXP

I did some search regarding this and found that left or substring function can be handy here but I have no idea how they will be used here to get what I need and don't know even if they can solve my issue. Any help regarding how to get the desired output would be great.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
zain ul abidin
  • 197
  • 2
  • 13
  • 2
    Standardizing Addresses is more involved than one may think. Take a peek at https://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 – John Cappelletti Mar 10 '19 at 19:38

5 Answers5

1

This could possibly be done by using a subquery in the same way that you would eliminate duplicates in a table so:

SELECT Str_Num, status
FROM <your_table> a
WHERE NOT EXISTS (SELECT 1
FROM <your_table> b
WHERE SUBSTRING(b.Str_Num, 1, 7) = SUBSTRING(a.Str_Num, 1, 7));

This would only work however if the number is guaranteed to be 3 characters long, or if you don't mind it taking more characters in the case that the number is fewer characters.

GeekKat
  • 24
  • 2
1

This will output only the first of the matching rows:

with cte as (
  select *,
  row_number() over (order by (select null)) rn
  from tablename
)
select St_Num, status from cte t
where not exists (
  select 1 from cte
  where 
    left(St_Num, 7) = left(t.St_Num, 7)
    and
    rn < t.rn
)

See the demo

forpas
  • 160,666
  • 10
  • 38
  • 76
  • This will fail if there is a "128 TIMBER HILL" or other resembling names or if there is a spelling error like "128 TMBER RUN DR" – Olivier Jacot-Descombes Mar 10 '19 at 20:02
  • *if the number in this case 128 and first 3 letters in St_Num column are same then these both rows should be considered the same* this is the requirement. – forpas Mar 10 '19 at 20:04
  • @zain ul abidin: If this requirement is only used to return data, it's okay, but it will lead to a catastrophe if you use it in an UPDATE statement to standardize addresses. – Olivier Jacot-Descombes Mar 10 '19 at 20:09
  • @forpas what i need is that when first time the query goes through 123 TIMBER RUN DR and then when in the St_Num column it agains goes thorugh a field which is 123 TIMBER RUN DRIVE then it should check that if 123 TIM are matching then do not display this record but show the first one only. – zain ul abidin Mar 10 '19 at 20:23
  • @zainulabidin Is there a column like an id (integer autoincrement) that would serve as the order of rows? – forpas Mar 10 '19 at 20:27
  • @forpas no sorry there is no ID column but how that will serve the purpose here, please? – zain ul abidin Mar 10 '19 at 20:39
  • If there is an integer id then the order is obvious and the row which will be chosen as output among the matching ones is the one with the minimum id. – forpas Mar 10 '19 at 20:41
1

You can use grouping by status and substring(St_Num,1,3)

with t(St_Num, status) as
(
 select '128 TIMBER RUN DR'   ,'EXP' union all
 select '128 TIMBER RUN DRIVE','EXP'   
) 
select min(St_Num) as St_Num, status
  from t    
 group by status, substring(St_Num,1,3);

St_Num              status
-----------------   ------
128 TIMBER RUN DR   EXP
Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
1

I don't really approve of your matching logic . . . but that is not your question. The big issue is how long is the number before the string. So, you can get the shortest of the addresses using:

select distinct t.*
from t
where not exists (select 1
                  from t t2
                  where left(t2.st_num, patindex('%[a-zA-Z]%') + 2, t.st_num) = left(t.st_num, patindex('%[a-zA-Z]%', t.st_num) + 2) and
                        len(t.St_Num) < len(t2.St_Num)
                 );
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

I still have odd feeling that your criteria is not enough to match same addresses but this might help, since it considers also length of the number:

WITH ParsedAddresses(st_num, exp, number)
AS
(
    SELECT st_num,
    exp,
    number = ROW_NUMBER() OVER(PARTITION BY LEFT(st_num, CHARINDEX(' ', st_num) + 3) ORDER BY LEN(st_num))
    FROM <table_name>
)
SELECT st_num, exp FROM ParsedAddresses
WHERE number = 1
Emil
  • 306
  • 5
  • 16