Ways to Clean-up messy records in sql

Question

I have the following sql data:

ID              Company Name        Customer            Address 1       City                        State  Zip   Date
0108500         AAA Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
0108500         AAA.Test            Mish~Sara           Newa Claims     Chtiana                     CO     123   06FE0046        
1802600         AAA Test Company    Ban, Adj.~Gorge     PO Box 83       MouLaurel                      CA      153   09JS0025        
1210600         AAA Test Company    Biwel~Brce          97kehst ve      Jacn                        CA     153   04JS0190

AAA Test, AAA.Test and AAA Test Company are considered as one company.

Since their data is messy I'm thinking either to do this:

Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?

In this case, the AAA Test and AAA.Test will be AAA Test Company.

OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?

If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.

Thank you in advance.

FuzzyTree is right - You'll have to come up with set rules for what you're calling "almost the same" and tell that to SQL. From there you could join the table on itself using those rules and identify rows, but you'll probably want to review all of them because it would be risky to blindly update from there. — the-nick-wilson, Apr 22 '15 at 14:59
rule 1: don't do this in a live production database. rule 2: be extremely careful with your customers' data. rule 3: don't delete stuff. just archive it off elsewhere or put an expiry date on it — roblogic, Apr 22 '15 at 15:02
Which dbms are you using? Some product specific functionality may be useful here! — jarlh, Apr 22 '15 at 15:08
@ropata, I suppose "Is there a way to search all the records" just means how to find suspected duplicates - not to delete them without any consideration. — jarlh, Apr 22 '15 at 15:10
Another approach would be to do the clean-up in two stages. First identify good-enough matches that can be 'cleaned' immediately. Secondly identify any remaining partial matches to check and then clean up manually. — roblogic, Apr 22 '15 at 16:19
I think this question should really be split into two parts: 1 - Strategically, what is the best way to approach this kind of issue (which is what some people have answered in these comments), and 2 - How, technically, would you do it, once you decide on a strategy. It's a very large scoped question right now. — the-nick-wilson, Apr 22 '15 at 16:35
Question is a likely duplicate of [Fuzzy matching using T-SQL](http://stackoverflow.com/questions/921978/fuzzy-matching-using-t-sql) — roblogic, Apr 22 '15 at 17:53

jarlh · Answer 1 · 2015-04-22T15:01:57.197

2

Something like a self join? || is ANSI SQL concat, some products have a concat function instead.

select *
from tablename t1
  join tablename t2 on t1.companyname like '%' || t2.companyname || '%'

Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.

And, as Miguel suggests, use REPLACE to remove commas and dots etc.

Use case-insensitive collation. SOUNDEX can be used etc etc.

edited Apr 22 '15 at 15:01

answered Apr 22 '15 at 14:55

jarlh

42,561
8
45
63

I would also include a replace statement in there to remove '.'s and other items that might appear. REPLACE(stringColumnName, '.', '') – Miguel Apr 22 '15 at 14:59
1

Good idea to convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). I don't think a concat will help though. – roblogic Apr 22 '15 at 16:09

score 2 · Accepted Answer · answered Apr 22 '15 at 15:01

2

You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.

answered Apr 22 '15 at 15:01

jradich1234

1,410
5
24
29

what is the sql version for it? – newbie Apr 22 '15 at 15:03
The [algorithm](http://stackoverflow.com/questions/27192215/what-is-the-best-algorithm-for-matching-two-string-containing-less-than-10-words) is in the [SimMetrics](http://sourceforge.net/projects/simmetrics/) library (java) but someone has built a .net version for [SqlServer](http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=158143). Oracle databases also have "[Oracle Text](http://www.orafaq.com/forum/t/188508/)" for this kind of task – roblogic Apr 22 '15 at 18:21

score 1 · Answer 3 · answered Apr 22 '15 at 15:05

I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity. for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx

roblogic · Answer 4 · 2015-04-23T18:16:42.100

You can do the clean-up in several stages.

Create new columns

Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows

/** Populate dupstr column for fuzzy match **/
update vendor v           
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;

Identify duplicate records

Add an index on the dupstr column, then do an update like this to identify "good" records:

/** Mark the good duplicates **/
update vendor v           
set v.status = 'keep'  --indicate keeper record
where 
  --dupes to clean up
  exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr 
           and v.id != v1.id )
and 
  ( --keeper has longest name
    length(v.companyname) = 
      ( select max(length(v2.companyname)) from vendor v2
        where v.dupstr = v2.dupstr 
      )
    or
    --keeper has latest record (assuming ID is sequential)
    v.id =
      ( select max(v3.id) from vendor v3
        where v.dupstr = v3.dupstr 
      )
  )
group by v.dupstr
;

The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.

Clean Up Stragglers

Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)

PS: the above "SQL" is a completely untested pseudo-code example. — roblogic, Apr 22 '15 at 22:26

score 1 · Answer 5 · answered Nov 08 '18 at 10:20

You can use SQL query with SOUDEX of DIFFRENCE

For example: SELECT DIFFERENCE ('AAA Test','AAA Test Company')

DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)

See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

Ways to Clean-up messy records in sql

5 Answers5

Create new columns

Identify duplicate records

Clean Up Stragglers