-1

Table 1

SchoolId_1   Name

Table 2

SchoolId_2   Name

SchooldId_1 and SchooldId_2 have different values. Name have similar but not the same values.

In Table 1, I got name column value 'University of Maine -- Farmington'. In Table 2, I have 'University of Maine at Farmington'. I want to join these but sql skips them because they are not exactly the same.

Table1

University of Strathclyde

University of Maine Fort Kent

Bryn Athyn College of the New Church

Adirondack Community College

Allen Community College

Nova Scotia Community College

Unity College

Allen College

Table2

University of Strathclyde -- Glasgow

University of Maine at Fort Kent

Bryn Athyn College

Unity College

Allen College

I want to join them using the name column with values stated above and the results will be like

SchoolId_1    Name    SchoolId_2
1 University of Strathclyde 1101

2 University of Maine Fort Kent 1102

3 Bryn Athyn College of the New Church 1103

4 Adirondack Community College NULL

5 Allen Community College NULL

6 Nova Scotia Community College NULL

7 Unity College 1104

8 Allen College 1105

I tried LIKE, CHARINDEX but they require Name columns to have exactly the same values.

Current code (but only returns exactly the same name values):

SELECT
    dsa.SchoolId_1, dsa.Name, ds.SchoolId_2, ds.sch_address, ds.city, ds.sch_state 
INTO 
    match_sch 
FROM 
    dbo.sa_schools dsa 
LEFT JOIN 
    dbo.school ds ON dsa.Name = ds.Name;
  • If you want to have a equality between two string with "kind of' logic prefer using `dsa.Name LIKE '%'+ ds.Name +'%'` – Max Nov 13 '18 at 16:45
  • 1
    This is referred to as *fuzzy matching*; but we're probably going to need a few more examples of values from the two tables that you want to be matched, so that we can provide a suitable answer. Please [edit] your question to add a few more examples that you would expect to match. – Richardissimo Nov 13 '18 at 18:13
  • @Richardissimo I edited my question and provided more values. Hope it will be clearer but let me know if you need more – Eirin Gonzales Nov 14 '18 at 04:38
  • @Max I did what you suggested but It maps my data like Nova Scotia Community College to Unity College. But actually, it would really map because of 'unity College'. Still, thank you! – Eirin Gonzales Nov 14 '18 at 04:41
  • "join them using the name column with those kinds of values" is not clear. You have not told us exactly what result table should be given for any input tables. So how can we help you? (Rhetorical.) Please use enough words & sentences to clearly say what you mean. See [ask]--what have you learned researching such "fuzzy" condtions? PS Please read & act on [mcve]. – philipxy Nov 14 '18 at 05:19
  • I think an approach using the [Levenshtein distance](https://en.m.wikipedia.org/wiki/Levenshtein_distance) would work well with this kind of data. It's described in [this answer](https://stackoverflow.com/a/9731894/5198140). The speed isn't great with huge data sets, but fuzzy matching is complex, so you just need to avoid doing it too often. For example, at the point where the unofficial name is inserted, you could do a fuzzy match and store the id from the other table at that point (if there was one). Then you can just join based on the id. But worry about that after you've tried it. – Richardissimo Nov 14 '18 at 06:37
  • @Richardissimo Thanks for this! I'll read and try that first. – Eirin Gonzales Nov 14 '18 at 07:26

1 Answers1

2

Fantastic question! I have this problem constantly thanks to user-generated data on freeform text fields.

My solution was to build a UDF to use python to fix this. My UDF uses python and specifically the fuzzywuzzy library. I pass in two strings and it returns a numeric score which can be used for filtering.

eg

SELECT
    t1.schoolid_1
    ,t2.schoolid_2
    ,t1.name    as name_1
    ,t2.name    as name_2
FROM
    table_1 t1
LEFT JOIN
    table_2 t2
    ON
        my_udf(t1.name, t2.name) > 70

Although it looks like you're running on SQL Server, so python might not be available. In that case you could try to hack together something similar in a SQL-based UDF.

eg

SELECT
    1.0 * a.matches / a.rows as match_score
FROM
    (SELECT
        count(1) as rows
        ,count(CASE WHEN t1 is not NULL and t2 is not NULL THEN 1 END) as matches
    FROM
        (SELECT value FROM string_split(t1.name, '1')
        ) t1
    FULL OUTER JOIN
        (SELECT value FROM string_split(t2.name, '1')
        ) t2
        ON
            t1.value = t2.value
    ) a
ScottieB
  • 3,958
  • 6
  • 42
  • 60