0

My current SQL query is:

Select 
A.CFCIF#, B.CIFNO, B.SNAME, A.CFSNME
From dbo.tbl_CIF_Master A
Left Join dbo.tbl_Loan_Master B
On A.CFCIF# = B.CIFNO
Where 
B.[STATUS] not in (2,8) 
--and CONTAINS (A.CFSNME or FORMSOF (THESAURUS, B.SNAME)) --doesn't work. 
  I'm not an admin so I can't design thesaurus mappings
and B.SNAME LIKE '%' + A.CFSNME + '%'  -- works but no results which can't 
  be accurate

This runs fine but finds no differences as I noted using LIKE. The line using THESAURUS is commented out as I noted... An example of the subtle differences I would find in the two 'name' fields SNAME & CFSNME would be subtly differences like missing a comma before LLC or Robert abbreviated to Rob.

Craig
  • 33
  • 2
  • 7
  • Thanks for your insight. I'm using Microsoft SQL Server 2012 R2 – Craig Sep 08 '17 at 20:36
  • would you consider one of your columns correct over the other? – PreQL Sep 12 '17 at 18:55
  • Either column could be incorrect in any instance, so I have to identify and document all differences. I only have Management Studio / SQL without the ability to create functions to develop the "answer." Appears to "suck" as there appear to multiple "programming" solutions. – Craig Sep 14 '17 at 15:26
  • There's no straightforward generic way to perform this that I can think of. It will depend on your data as to how complex your solution would be. For example, does Richard or Margaret get shortened to Dick or Peggy? What about LTD and Limited? I would consider first pulling out a distinct set of `SNAME` and `CFSNME` by left joining on where 85% of `SNAME` is in `CFSNME` and vice-versa (where its more than like 5 characters). Take a look at the matches and iteratively build up the logic for more robust matching but that assumes many, many things about your data. How big is your data? – PreQL Sep 14 '17 at 17:23
  • Each column only has roughly 13,000 rows. My sampling so far proves the differences will be far more subtle which perhaps makes this difficult. No period after a middle initial like Parsons, Micah M vs Parsons, Micah M. Ditto for business customers - perhaps a missing comma before "LLC" like ABC Materials LLC which could also show in the other column as ABC Materials, LLC. – Craig Sep 14 '17 at 18:12
  • well...thinking on it, do you need 'subtle' differences? Since you have a `CIFNO` then you know that the names should match, so just change to `and B.SNAME <> A.CFSNME` – PreQL Sep 15 '17 at 17:47

1 Answers1

0

Given the indefinite nature of the differences that you are looking for (which the strict substring matching that you have implemented will not catch), you might consider calculating a similarity metric between the columns and then determine an appropriate critical value of that similarity metric to identify strings that are the same but for subtle differences. See A better similarity ranking algorithm for variable length strings for a similarity metric that you might want to use.

rd_nielsen
  • 2,407
  • 2
  • 11
  • 18