SQL Server comparing part of an address string, 2 columns

Question

I would like to ask for help regarding how to do a partial string comparison of two address values stored in two separate columns. I need to only identify the values that match even if the match is not 100%. I do not care for the value for the street direction: N, S, E, W, or whether it is a suite, apartment, nor the type of street (rd, st, dr, cr, etc). Perhaps the requirement is fulfilled by only matching the first value, the house number.

An example would be:

Column1          Column2
17 Wickham CT    17 S WICKHAM CT # 2 << This is a partial match, include
6818 Chester DR  6801 CHESTER DR # A << This is a partial match, include
6301 Raymond RD  6301 RAYMOND RD     << This is a full match, include
6217 Raymond RD  PO BOX 45581        << This doesn't match, don't include

I have the query that retrieves everything, I just need to figure out how to filter the records that I don't need.

If I could get any help on this, it would be awesome.

Thank you.

UPDATE

I think I am understanding the data a bit better. I hope I can get help for the following context.

OwnerID     Addr_Nbr    Address_A           Address_B
3336223     2204138     11 Westbrook CIR    11 WESTBROOK CIR
3336223     2431628     9 Westbrook CIR     11 WESTBROOK CIR
3337465     2328720     4214 School RD      4214 SCHOOL RD

In that scenario, from the first two records the first one seem to be a match, but it is not, because that owner has more than one address, the third record is a good match because the address matches and the owner only has one address. How can I select only the records where the address matches, and there is only one address number for the owner? I hope this makes sense.

Thank you again

That is pretty custom. I think you need C# or another programming language. I bet you could do it in TSQL cursor but it would be a pain. Levenschtein distance my do what you are looking for. — paparazzo, Jun 14 '18 at 17:31

score 1 · Answer 1 · answered Jun 14 '18 at 16:23

1

Try using Levenshtein distance. This link gives a few examples of how to implement it using T-SQL.

Levenshtein distance basically gives you a value for the differences between two strings. You can come up with an acceptable value for your query and return any that hits that acceptable threshold.

answered Jun 14 '18 at 16:23

Nerrickk

49
1
8

Thank you, I looked at the article you included, but I am lost as to how to implemented – erasmo carlos Jun 14 '18 at 17:39
Create the edit_distance_within function, then pass in your two values. The third is a threshold that you need to determine is viable for a match (10 is an arbitrary number, you'll need to tweak it yourself). `select edit_distance_within(Column1, Column2, 10)`17 S WICKHAM CT # 2 17 **WICKHAM CT**** It would take 6 edits to match, therefore the levenshtein distance is 6 for this example. Each asterisk in this case is a delete. Your last example has a distance of 14 because almost every character needs to be replaced. – Nerrickk Jun 14 '18 at 18:32

score 0 · Accepted Answer · answered Jun 14 '18 at 18:51

You can also try using Pearson's correlation coefficient

Perfect example for MS SQL is here.

Here is my quick assembled example:

--helper function to convert string to ASCII column
CREATE FUNCTION dbo.fn_StringToASCII
(
    @text VARCHAR(MAX)
)
RETURNS @Result TABLE
(
    POS INT,
    NUM INT
)
AS
BEGIN
    DECLARE @i INT
    SET @i = 1
    WHILE @i <= LEN(@text)
    BEGIN
        INSERT INTO @Result
        (
            POS,
            NUM
        )
        VALUES
        (@i, ASCII(SUBSTRING(@text, @i, 1)))
        SET @i = @i + 1
    END
    RETURN;
END;

-- test example
 CREATE TABLE test1(ID INT, ADDR1 VARCHAR(20));
 CREATE TABLE test2(ID INT, ADDR2 VARCHAR(20));

 INSERT INTO dbo.test1
(
    ID,
    ADDR1
)
VALUES
(1, '17 Wickham CT'),
(2, '6818 Chester DR'),
(3, '6217 Raymond RD'),
(4, 'TEST');

INSERT INTO dbo.test2
(
    ID,
    ADDR2
)
VALUES
(1, '17 S WICKHAM CT # 2'),
(2, '6801 CHESTER DR # A'),
(3, 'PO BOX 45581'),
(4, 'TEST');

 --query with coeff
SELECT ISNULL(t1.ID, c2.ID) AS ID,
       (AVG(c1.NUM * c2.NUM) - (AVG(c1.NUM) * AVG(c2.NUM))) / (STDEVP(c1.NUM) * STDEVP(c2.NUM)) AS  Coeff
FROM dbo.test1 t1
    CROSS APPLY dbo.fn_StringToASCII(LOWER(t1.ADDR1)) c1
    RIGHT JOIN
    (
        SELECT t2.ID,
               c2.*
        FROM dbo.test2 t2
            CROSS APPLY dbo.fn_StringToASCII(LOWER(t2.ADDR2)) c2
    ) c2
        ON c2.ID = t1.ID
           AND c2.POS = c1.POS
WHERE 1 = 1
GROUP BY ISNULL(t1.ID, c2.ID);


DROP TABLE dbo.test1
DROP TABLE dbo.test2

/*
results
ID  Coeff
1   0.957280794307261
2   1.58310202187124
3   -0.397204343866094
4   0.987654320987654
*/

SQL Server comparing part of an address string, 2 columns

2 Answers2