-1

I am trying to do fuzzy string matching to get as more matches as I can. First I execute the Levenshtein Distance Algorithm (http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx) and store it as "distance" in my dbo.

My first table (t1) looks like this:

Name | Synonym
 A   |   A1
 A   |   A2
 A   |   A3
 B   |   B1
 B   |   B2

My second table (t2) looks like this: The ID field may look like Name and Synonym very much

 ID  | Description
 A   |    XXX
 B   |    YYY

My goal is to make left joins either on the Name or its Synonyms when the distance between 2 strings from each table (t1 and t2) are smaller than 2.

Here is my current work:

SELECT * 
FROM (
    SELECT t2.ID, ROW_NUMBER() over (partition by ID order by ID) as rn
    FROM table1 as t1
    LEFT JOIN table2 as t2
    ON (upper(trim(t1.Name)) = upper(trim(t2.ID)) OR upper(trim(t1.Synonym)) = upper(trim(t2.ID)))
    WHERE (dbo.distance(t1.Name,t2.ID)<=2 OR dbo.distance(t1.Synonym,t2.ID)<=2)
) temp
WHERE rn=1

Ideally, as long as their distance is smaller than 2, we will still doing the join.

It should get more matches by adding that condition, however it doesn't.

Am I missing anything here?


I was wondering if my problem is coming from this:

My intention is to see if the conditions meet, if so then just do the join. But my code here probably tells SQL to "join first", and the filter it afterwards.

Is there a way to let it see if the condition qualifies and then do the join "after"?

David
  • 63
  • 5
  • 2
    Expected results for the given sample data would help. – Dale K May 05 '20 at 03:03
  • Without that where clause, I get 3,824 matches. After adding that where clause, I only get 3,715. – David May 05 '20 at 03:05
  • 2
    And what are you expecting? Identify some rows which should be coming out and aren't (or vice versa) and check/post the data. – Dale K May 05 '20 at 03:25
  • I think I found the problem, see my edits. – David May 05 '20 at 03:46
  • 1
    Where does ID come into picture ? you have not defined table2 – Venkataraman R May 05 '20 at 03:52
  • Thanks for reminding, I have just added that – David May 05 '20 at 04:00
  • Does this answer your question? [Left Outer Join doesn't return all rows from my left table?](https://stackoverflow.com/questions/4707673/left-outer-join-doesnt-return-all-rows-from-my-left-table) – philipxy May 05 '20 at 04:13
  • 1
    Learn what LEFT JOIN ON returns: INNER JOIN ON rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN ON you want as part of an OUTER JOIN ON. After a LEFT JOIN ON, a WHERE or INNER JOIN that requires a right [sic] table column to be not NULL removes any rows with introduced NULLs, ie leaves only INNER JOIN ON rows, ie "turns OUTER JOIN into INNER JOIN". You have that. – philipxy May 05 '20 at 04:14
  • 1
    Please in code questions give a [mre]--cut & paste & runnable code, including smallest representative example input as code; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. Give the least code you can give that is code that you show is OK extended by code that you show is not OK. (Debugging fundamental.) For SQL that includes DBMS & DDL (including constraints & indexes) & input as code formatted as a table. [ask] Isolate the 1st subexpression not returning what you expect & say what you expect & why. (Debugging fundamental.) – philipxy May 05 '20 at 04:18
  • 1
    _Is there a way to let it see if the condition qualifies and then do the join "after"_. The only way to work out if the join qualifies is to compare columns in different tables. That is.... a _join_. To simplify a prior comment: when you do a left join, but then filter (`where`) on a column the result, you effectively make it an inner join. – Nick.Mc May 05 '20 at 04:22
  • 1
    You really need to read an introduction to querying in SQL. One does not conditionally join. One says which rows of a join one wants to keep. `x inner join y on c` is `x cross join y where c` ie `(x.c1,...,y.d1,...)` rows where `(x.c1,...) IN x AND (y.d1,...) IN y AND c`. So for each pair of x-y rows, when do you keep it? That's c. A table--base or query result--holds rows satisfying a criterion. INNER JOIN, ON & WHERE all become AND in a query result criterion. [Re querying.](https://stackoverflow.com/a/33952141/3404097) [Re INNER vs CROSS JOIN.](https://stackoverflow.com/a/25957600/3404097) – philipxy May 05 '20 at 04:39

1 Answers1

0

I have tried DIFFERENCE function just for demo purpose. It finds if two strings are similar and then returns 4 (lowest possible difference) and goes down to 0(lowest possible difference). You can try similar logic using your distance function.

DECLARE @table1 TABLE(Name varchar(10), synon varchar(10))

DECLARE @table2 TABLE(Name varchar(10), synon varchar(10))

INSERT INTO @table1
VALUES ('A','A1'),('A','A2'),('A','A3'),('B','B1'),('B','B2'),('B','B3')

INSERT INTO @table2
VALUES ('A','A1'),('A','A2'),('C','C1'),('B','B2'),('B','B3')


SELECT t1.name, t1.synon, t2.Name,t2.synon
FROM @table1 as t1
CROSS APPLY (SELECT T2.Name, t2.synon FROM @table2 as t2 WHERE DIFFERENCE(t2.Name,t1.Name) = 4 OR DIFFERENCE(t2.synon,t1.synon) = 4) as t2
+------+-------+------+-------+
| name | synon | Name | synon |
+------+-------+------+-------+
| A    | A1    | A    | A1    |
| A    | A2    | A    | A1    |
| A    | A3    | A    | A1    |
| A    | A1    | A    | A2    |
| A    | A2    | A    | A2    |
| A    | A3    | A    | A2    |
| B    | B1    | B    | B2    |
| B    | B2    | B    | B2    |
| B    | B3    | B    | B2    |
| B    | B1    | B    | B3    |
| B    | B2    | B    | B3    |
| B    | B3    | B    | B3    |
+------+-------+------+-------+
Venkataraman R
  • 12,181
  • 2
  • 31
  • 58