Specific string matching

Question

I am working in SQL Server 2012. In my table, there is a column called St_Num and its data is like this:

St_Num                 status
------------------------------
128 TIMBER RUN DR      EXP
128 TIMBER RUN DRIVE   EXP

Now we can notice that there are spelling variations in the data above. What I would like to do is that if the number in this case 128 and first 3 letters in St_Num column are same then these both rows should be considered the same like this the output should be:

St_Num                status
-----------------------------
128 TIMBER RUN DR     EXP

I did some search regarding this and found that left or substring function can be handy here but I have no idea how they will be used here to get what I need and don't know even if they can solve my issue. Any help regarding how to get the desired output would be great.

Standardizing Addresses is more involved than one may think. Take a peek at https://stackoverflow.com/questions/41249742/address-standardization-within-a-database/41249971#41249971 — John Cappelletti, Mar 10 '19 at 19:38

score 1 · Answer 1 · answered Mar 10 '19 at 19:56

This could possibly be done by using a subquery in the same way that you would eliminate duplicates in a table so:

SELECT Str_Num, status
FROM <your_table> a
WHERE NOT EXISTS (SELECT 1
FROM <your_table> b
WHERE SUBSTRING(b.Str_Num, 1, 7) = SUBSTRING(a.Str_Num, 1, 7));

This would only work however if the number is guaranteed to be 3 characters long, or if you don't mind it taking more characters in the case that the number is fewer characters.

forpas · Accepted Answer · 2019-03-10T21:24:23.707

1

This will output only the first of the matching rows:

with cte as (
  select *,
  row_number() over (order by (select null)) rn
  from tablename
)
select St_Num, status from cte t
where not exists (
  select 1 from cte
  where 
    left(St_Num, 7) = left(t.St_Num, 7)
    and
    rn < t.rn
)

See the demo

edited Mar 10 '19 at 21:24

answered Mar 10 '19 at 19:58

forpas

160,666
10
38
76

This will fail if there is a "128 TIMBER HILL" or other resembling names or if there is a spelling error like "128 TMBER RUN DR" – Olivier Jacot-Descombes Mar 10 '19 at 20:02
*if the number in this case 128 and first 3 letters in St_Num column are same then these both rows should be considered the same* this is the requirement. – forpas Mar 10 '19 at 20:04
@zain ul abidin: If this requirement is only used to return data, it's okay, but it will lead to a catastrophe if you use it in an UPDATE statement to standardize addresses. – Olivier Jacot-Descombes Mar 10 '19 at 20:09
@forpas what i need is that when first time the query goes through 123 TIMBER RUN DR and then when in the St_Num column it agains goes thorugh a field which is 123 TIMBER RUN DRIVE then it should check that if 123 TIM are matching then do not display this record but show the first one only. – zain ul abidin Mar 10 '19 at 20:23
@zainulabidin Is there a column like an id (integer autoincrement) that would serve as the order of rows? – forpas Mar 10 '19 at 20:27
@forpas no sorry there is no ID column but how that will serve the purpose here, please? – zain ul abidin Mar 10 '19 at 20:39
If there is an integer id then the order is obvious and the row which will be chosen as output among the matching ones is the one with the minimum id. – forpas Mar 10 '19 at 20:41

Barbaros Özhan · Answer 3 · 2019-03-10T21:13:17.983

1

You can use grouping by status and substring(St_Num,1,3)

with t(St_Num, status) as
(
 select '128 TIMBER RUN DR'   ,'EXP' union all
 select '128 TIMBER RUN DRIVE','EXP'   
) 
select min(St_Num) as St_Num, status
  from t    
 group by status, substring(St_Num,1,3);

St_Num              status
-----------------   ------
128 TIMBER RUN DR   EXP

edited Mar 10 '19 at 21:13

answered Mar 10 '19 at 20:54

Barbaros Özhan

59,113
10
31
55

score 1 · Answer 4 · answered Mar 10 '19 at 21:16

I don't really approve of your matching logic . . . but that is not your question. The big issue is how long is the number before the string. So, you can get the shortest of the addresses using:

select distinct t.*
from t
where not exists (select 1
                  from t t2
                  where left(t2.st_num, patindex('%[a-zA-Z]%') + 2, t.st_num) = left(t.st_num, patindex('%[a-zA-Z]%', t.st_num) + 2) and
                        len(t.St_Num) < len(t2.St_Num)
                 );

Emil · Answer 5 · 2019-03-10T22:01:08.147

I still have odd feeling that your criteria is not enough to match same addresses but this might help, since it considers also length of the number:

WITH ParsedAddresses(st_num, exp, number)
AS
(
    SELECT st_num,
    exp,
    number = ROW_NUMBER() OVER(PARTITION BY LEFT(st_num, CHARINDEX(' ', st_num) + 3) ORDER BY LEN(st_num))
    FROM <table_name>
)
SELECT st_num, exp FROM ParsedAddresses
WHERE number = 1

Specific string matching

5 Answers5