0

I would like to ask for help regarding how to do a partial string comparison of two address values stored in two separate columns. I need to only identify the values that match even if the match is not 100%. I do not care for the value for the street direction: N, S, E, W, or whether it is a suite, apartment, nor the type of street (rd, st, dr, cr, etc). Perhaps the requirement is fulfilled by only matching the first value, the house number.

An example would be:

Column1          Column2
17 Wickham CT    17 S WICKHAM CT # 2 << This is a partial match, include
6818 Chester DR  6801 CHESTER DR # A << This is a partial match, include
6301 Raymond RD  6301 RAYMOND RD     << This is a full match, include
6217 Raymond RD  PO BOX 45581        << This doesn't match, don't include

I have the query that retrieves everything, I just need to figure out how to filter the records that I don't need.

If I could get any help on this, it would be awesome.

Thank you.

UPDATE

I think I am understanding the data a bit better. I hope I can get help for the following context.

OwnerID     Addr_Nbr    Address_A           Address_B
3336223     2204138     11 Westbrook CIR    11 WESTBROOK CIR
3336223     2431628     9 Westbrook CIR     11 WESTBROOK CIR
3337465     2328720     4214 School RD      4214 SCHOOL RD  

In that scenario, from the first two records the first one seem to be a match, but it is not, because that owner has more than one address, the third record is a good match because the address matches and the owner only has one address. How can I select only the records where the address matches, and there is only one address number for the owner? I hope this makes sense.

Thank you again

erasmo carlos
  • 664
  • 5
  • 16
  • 37
  • That is pretty custom. I think you need C# or another programming language. I bet you could do it in TSQL cursor but it would be a pain. Levenschtein distance my do what you are looking for. – paparazzo Jun 14 '18 at 17:31

2 Answers2

1

Try using Levenshtein distance. This link gives a few examples of how to implement it using T-SQL.

Levenshtein distance basically gives you a value for the differences between two strings. You can come up with an acceptable value for your query and return any that hits that acceptable threshold.

Nerrickk
  • 49
  • 1
  • 8
  • Thank you, I looked at the article you included, but I am lost as to how to implemented – erasmo carlos Jun 14 '18 at 17:39
  • Create the edit_distance_within function, then pass in your two values. The third is a threshold that you need to determine is viable for a match (10 is an arbitrary number, you'll need to tweak it yourself). `select edit_distance_within(Column1, Column2, 10)`17 S WICKHAM CT # 2 17 **WICKHAM CT**** It would take 6 edits to match, therefore the levenshtein distance is 6 for this example. Each asterisk in this case is a delete. Your last example has a distance of 14 because almost every character needs to be replaced. – Nerrickk Jun 14 '18 at 18:32
0

You can also try using Pearson's correlation coefficient

Perfect example for MS SQL is here.

Here is my quick assembled example:

--helper function to convert string to ASCII column
CREATE FUNCTION dbo.fn_StringToASCII
(
    @text VARCHAR(MAX)
)
RETURNS @Result TABLE
(
    POS INT,
    NUM INT
)
AS
BEGIN
    DECLARE @i INT
    SET @i = 1
    WHILE @i <= LEN(@text)
    BEGIN
        INSERT INTO @Result
        (
            POS,
            NUM
        )
        VALUES
        (@i, ASCII(SUBSTRING(@text, @i, 1)))
        SET @i = @i + 1
    END
    RETURN;
END;

-- test example
 CREATE TABLE test1(ID INT, ADDR1 VARCHAR(20));
 CREATE TABLE test2(ID INT, ADDR2 VARCHAR(20));

 INSERT INTO dbo.test1
(
    ID,
    ADDR1
)
VALUES
(1, '17 Wickham CT'),
(2, '6818 Chester DR'),
(3, '6217 Raymond RD'),
(4, 'TEST');

INSERT INTO dbo.test2
(
    ID,
    ADDR2
)
VALUES
(1, '17 S WICKHAM CT # 2'),
(2, '6801 CHESTER DR # A'),
(3, 'PO BOX 45581'),
(4, 'TEST');

 --query with coeff
SELECT ISNULL(t1.ID, c2.ID) AS ID,
       (AVG(c1.NUM * c2.NUM) - (AVG(c1.NUM) * AVG(c2.NUM))) / (STDEVP(c1.NUM) * STDEVP(c2.NUM)) AS  Coeff
FROM dbo.test1 t1
    CROSS APPLY dbo.fn_StringToASCII(LOWER(t1.ADDR1)) c1
    RIGHT JOIN
    (
        SELECT t2.ID,
               c2.*
        FROM dbo.test2 t2
            CROSS APPLY dbo.fn_StringToASCII(LOWER(t2.ADDR2)) c2
    ) c2
        ON c2.ID = t1.ID
           AND c2.POS = c1.POS
WHERE 1 = 1
GROUP BY ISNULL(t1.ID, c2.ID);


DROP TABLE dbo.test1
DROP TABLE dbo.test2

/*
results
ID  Coeff
1   0.957280794307261
2   1.58310202187124
3   -0.397204343866094
4   0.987654320987654
*/
Serge Makarov
  • 351
  • 2
  • 7