0

enter image description here

I am learning SQL and need to write a code to identify potential duplicates in the column 'party_name' in the image. The matches not necessarily be 100%, I need to identify any % of matches between two or more entries in the party name. Can anyone please guide?

I found SOUNDEX() but it does not work .. or maybe I am not doing it correct.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
esinmal
  • 3
  • 2
  • _"I found `SOUNDEX()` but it does not work"_ - Post the query that you tried using. Though `SOUNDEX` is only useful for short strings. That said, SQL is **the worst language for text-processing**. You're better-off doing this in Excel or something. – Dai Nov 04 '22 at 11:54
  • There are ways to implement things like [Levenshtein distance in T-SQL](https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql), however, they are often far from performant. Such tasks are *normally* better done outside of your SQL layer – Thom A Nov 04 '22 at 11:58
  • In regards to `SOUNDEX`, it is aimed at individual words, rather than phrases. `SOUNDEX('The Simpsons')` and `SOUNDEX('The Universe')` both return `'T000'` despite being completely different. – Thom A Nov 04 '22 at 12:01
  • @Larnu Thank you for your replies.. I was thinking the same that it would be better if I tried this with Excel.. but wanted to find out whether sql can do something that could be easier. – esinmal Nov 04 '22 at 13:25
  • @Dai Thank you for your replies.. I was thinking the same that it would be better if I tried this with Excel.. but wanted to find out whether sql can do something that could be easier. – esinmal Nov 04 '22 at 13:26
  • Percentage match won't help, either. For example, `HSBC` and `HSBC Private Bank (UK) Limited` look like dupes, but only match about 13%. To really do this well you need a set of real human eyes on these rows. Typically, that's best done by pushing out to the entry form in the **user interface**, by making users pick from a list at entry time, rather than typing or pasting in a full name. Another option is making a **mapping table**, so each candidate value currently in the table maps to a **canonical value**. Then the canonical values can match exactly. But again: the mapping needs real humans – Joel Coehoorn Nov 04 '22 at 16:14

1 Answers1

1

Providing your demo data as an object is really helpful, especially when you have long strings.

DECLARE @table TABLE (PartyID NVARCHAR(10), Party_Name NVARCHAR(50))
INSERT INTO @table (PartyID, Party_Name) VALUES
('K0164', 'Bank of China (Hong Kong) Ltd.'),
('GP'   , 'EFG Private Bank'),
('R0054', 'Star Magnolia Capital Limited'),
('E0054', 'Alternative Capital Investments SPC, Ltd.'),
('E0045', 'Alternativa Capital Ltd')

So it's important to note here that this will likely not be super performant, but as an exercise in "can I do that" it was kinda neat.

What we're doing here is breaking the strings down to single words (by splitting on the white space) and then using SOUNDEX to give them a value. Using that we can look for matches between the words on other PartyIDs. From this we can figure out a total of words that match and use that to determine a percentage of match. In the case of 'Alternativa Capital Ltd' compared to 'Alternative Capital Investments SPC, Ltd.' we found all three close-enough matches so it's 100%.

;WITH matchEmUp AS (
SELECT *, SOUNDEX(value) AS SndEx, ROW_NUMBER() OVER (PARTITION BY PartyID ORDER BY SOUNDEX(value)) AS WordID, COUNT(PartyID) OVER (PARTITION BY PartyID ORDER BY (SELECT 1)) AS Cnt
  FROM @table 
  CROSS APPLY STRING_SPLIT(Party_name, ' ')
), matchy AS (
SELECT a.PartyID AS ThisPartyID, b.PartyID AS OtherPartyID, COUNT(b.PartyID) OVER (PARTITION BY a.PartyID, B.PartyID) AS OtherCnt
  FROM matchEmUp a
    LEFT OUTER JOIN matchEmUp b
      ON a.PartyID <> b.PartyID
      AND a.SndEx = b.SndEx
)

SELECT ThisPartyID, OtherPartyID, MAX(Cnt+.0) AS ThisCnt, MAX(OtherCnt+.0) AS OtherCnt, MAX(OtherCnt+.0) / MAX(Cnt+.0) AS MatchPct
  FROM matchy my
    INNER JOIN matchEmUp meu
      ON my.ThisPartyID = meu.PartyID
 WHERE OtherPartyID IS NOT NULL
 GROUP BY ThisPartyID, OtherPartyID
ThisPartyID OtherPartyID    ThisCnt OtherCnt    MatchPct
----------------------------------------------------------------
K0164       E0045           6.0     1.0         0.16666666666666
E0054       R0054           5.0     1.0         0.20000000000000
E0045       E0054           3.0     3.0         1.00000000000000
R0054       E0045           4.0     1.0         0.25000000000000
E0045       K0164           3.0     1.0         0.33333333333333
E0054       E0045           5.0     3.0         0.60000000000000
E0045       R0054           3.0     1.0         0.33333333333333
GP          K0164           3.0     1.0         0.33333333333333
K0164       E0054           6.0     1.0         0.16666666666666
K0164       GP              6.0     1.0         0.16666666666666
R0054       E0054           4.0     1.0         0.25000000000000
E0054       K0164           5.0     1.0         0.20000000000000

Now you have to decide if the very fuzzy soundex matching is close enough for your needs or not.

Patrick Hurst
  • 2,086
  • 1
  • 3
  • 13