2

I have a process that creates a list of possible duplicate companies. The problem is that "The ABC Company, Inc." and "ABC Company, Inc." both in Dallas, TX are probably duplicates but I won't find them with my criteria. I've eliminated the first 4 characters if they are "the " but I also need to check for the right 5 characters if they are " Inc.".

I have a view that creates a column thename. The prefix "the " has been stripped;

    SELECT     CASE WHEN LEFT(name, 4) = 'The ' THEN RIGHT(name, (len(name) - 4)) ELSE name END AS thename, CASE WHEN CHARINDEX(' ', ltrim(rtrim(Name))) 
                          = 0 THEN ltrim(Name) WHEN CHARINDEX(' ', ltrim(Name)) = 1 THEN ltrim('b') ELSE SUBSTRING(ltrim(Name) + ' x', 1, CHARINDEX(' ', ltrim(Name))) END AS subname, 
                          CHARINDEX(' ', LTRIM(Name)) AS wordcheck, Name, Address_Line_1, City AS Company_City, State AS Company_State, Zip, Area_Code, Phone, Status_Flag, ID, 
                          Not_Dupe_Flag, DUNS, Temp_Check_Dupes_Flag, Parent_Company_Number, Special_Display, 
                          CASE WHEN c.parent_company_number = 0 THEN c.id ELSE c.parent_company_number END AS parent
    FROM         dbo.Companies AS c

Then I use that view in my query to look for duplicates;

        <cfquery name="qResults" datasource="#request.dsnlive#" timeout="200">
            SELECT b.ID,
            Thename,
            substring(TheName,1,(CHARINDEX(' ',TheNAME,1))) as subName,
            name,
            b.address_line_1,
            b.zip,
            b.company_state,
            b.company_city,
            b.area_code,
            b.phone,
            b.Special_Display,
            isnull(not_dupe_flag,'False') as not_dupe_flag,
            isnull(Temp_Check_Dupes_Flag,'False') as Temp_Check_Dupes_Flag,
            b.id as bID,
            b.duns
            FROM dbo.vw_Comp_Details_withFirstWord as b
            WHERE isnull(b.status_flag,'') != 'D'
            and b.ID <> #arguments.CompNum#
                and isnull(b.Temp_Check_Dupes_Flag,'False') = 'False'
            <cfif arguments.IncludeDunsOnly eq 0>
                <cfif arguments.FirstWord>
                    AND b.subName = '#arguments.CompanySubName#'
                <cfelse>
                    AND (substring(dbo.KeepAlphaNumCharacters(Thename),1,#val(arguments.WordLength)#) = substring('#arguments.CompanyName#',1,#val(arguments.WordLength)#)
                    or differnce(soundex(Thename),soundex('#arguments.CompanyName#')) > 2)
                </cfif>
                AND (
                ( company_city = '#arguments.City#'
                 AND Isnull(company_city, '') > '' )
                 AND ( b.parent != #val(arguments.Parent)#
                     AND Isnull(b.parent, '0') > 0 )
                     )
                <cfif arguments.IncludeDuns>
                AND (
                ( REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = '#val(arguments.Duns)#'
                 AND REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') > ' '
                 AND #val(arguments.Duns)# > 0 )
                    or  REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = ' '
                    )
                </cfif>

            <cfelse>
                and (REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = '#val(arguments.Duns)#')
            </cfif>


        </cfquery>

Now I need to add code to strip the suffix " Inc." but I can't seem to come up with the logic to end up with a column that contains the name without the prefix "The " and the suffix " Inc."

Leigh
  • 28,765
  • 10
  • 55
  • 103
user990016
  • 3,208
  • 4
  • 20
  • 29
  • This is a pretty common problem. (Use triggers to make sure same company isn't inserted twice.) – jarlh Sep 25 '15 at 10:22
  • Too generic. I have companies that have multiple offices and some in the same city. It's just not as easy as it sounds. – user990016 Sep 25 '15 at 10:24
  • And the database engine is? – Dan Bracuk Sep 25 '15 at 11:01
  • MS SQL 2012 and Coldfusion 9 – user990016 Sep 25 '15 at 11:06
  • There a couple of good ideas below. I've tested variations of them but can't get to a viable solution. The varied sources of my data are as diverse in their creation of company names as companies are at their legal structures. A quasi solution so far is to only compare on the first x number of characters. But we have to make multiple runs for desired results. I am still working on Juan Carlos' suggestion via MS SQL. – user990016 Sep 28 '15 at 05:09

2 Answers2

2

I will like to share my question from some days ago. This was made in Postgres but Im sure you can find an equivalent for split string into rows for your rdbms.

What you do is split the string and remove the offending string like The or Inc

SQL Fiddle Demo

| ID |   token |
|----|---------|
|  1 |     The |
|  1 |     ABC |
|  1 | Company |
|  1 |    Inc. |
|  2 |     ABC |
|  2 | Company |
|  2 |    Inc. |
|  3 |     ABC |
|  3 | Company |

Then you go the other way and join the remaining strings together postgres use string_agg() MSsql use XML PATH, etc

Community
  • 1
  • 1
Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
  • if you check my example you can do a `count(token)`+ `group by token` to check what item repeat more than once to decide which token arent relevant. In my app I have a table `descriptor` where i put the irrelevant items – Juan Carlos Oropeza Sep 28 '15 at 05:50
1

Many possible ways to do this. Consider if you want to have a fulltext index won the field which can then search for similar names and eliminate noise words like the. Or you can use an SSIS package to do fuzzy matching (this would also help with abbreviations vice spelling the whole word out). Or you can use Data Quality Services which is probably your best bet.

https://msdn.microsoft.com/en-us/library/ff877917.aspx

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • Hello HLGEM. I wonder if you have chance to see my [**Question**](http://stackoverflow.com/questions/32619987/full-text-difference-between-sql-server-and-postgres) regarding Full Text. I know is very general but looks like you have some experience there. – Juan Carlos Oropeza Sep 26 '15 at 00:47