Removing "almost duplicates" using SAS or Excel

Question

I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product**
Coca Cola    Coca Cola Light
Coca Cola    Coca Cola Lgt
Coca Cola    Cocacolalight
Coca Cola    Coca Cola Vanila
  Pepsi       Pepsi Zero
  Pepsi       Pepsi Zro

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**
    Coca Cola    Coca Cola Light
    Coca Cola    Coca Cola Vanila
      Pepsi       Pepsi Zero

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.

A note: there is an entire field around this sort of process, and it's very difficult to do effectively. — Joe, Mar 31 '15 at 17:35

score 4 · Accepted Answer · edited May 23 '17 at 10:24

I'm going to start by straight up quoting Jeff's answer :

SAS has at least a couple functions for calculating edit distance between two strings:

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

There's also the spedis() function for comparing edit distances.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

First some data:

Data HAVE;
  attrib name length=$20 alt_name length=$20;
  infile datalines dsd dlm=',' truncover;
  input name $ alt_name $;
  datalines;
Coca Cola    ,Coca Cola Light
Coca Cola    ,Coca Cola Lgt
Coca Cola    ,Cocacolalight
Coca Cola    ,Coca Cola Vanila
Pepsi        ,Pepsi Zero
Pepsi        ,Pepsi Zro
;
Run;

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;
  create table cartesian as
  select a.name,
         a.alt_name as alt_name1,
         b.alt_name as alt_name2,
         soundex(a.alt_name) as soundex_a,
         soundex(b.alt_name) as soundex_b
  from have a, have b
  where a.name = b.name
    and soundex(a.alt_name) eq soundex(b.alt_name)
  ;
quit;

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.

I should also add, that if you have a LOT of data, you may want to be careful with that SQL statement as it is performing a cross join (also known as a cartesian product join) and can produce a LOT of data. You may be better off getting the distinct list of products first then doing the cross join against that. — Robert Penridge, Mar 31 '15 at 16:33
The non-sql method for this is basically to create a new variable that actually stores SOUNDEX, then sort by that. Then you can collapse it in the datastep fairly similarly to the SQL, except without a cartesian product. I would note that this doesn't really work all that well, but what can you do. — Joe, Mar 31 '15 at 17:35
@RobertPenridge thank you very much for your answer. It was very interesting since I did not know any of the functions you mentioned, and I was also able to use the SQL method because my data were not that big. A small question though regarding the `soundex()` function. When used to compare e.g. `Coca Cola Light` and `Coca Cola Lgt` the values for those 2 were something like `A1233582247` and `A12335822`. So it was not possible to compare the values of the `soundex()` values and say e.g. if `-5 — Noob_Strider, Apr 07 '15 at 11:37
Hmm, I don't think I would use logic like that as I think the numbers are created by concatenating a number assigned to each phonetic sound. In this case, there's an additional sound on the end of the number so that logic would work, but if the difference was at the beginning it wouldn't work. A better option may be to use one of the edit distance functions in combination with the soundex function. Ie. if the soundex is the same OR the edit distance is less than x. — Robert Penridge, Apr 07 '15 at 15:42

Removing "almost duplicates" using SAS or Excel

1 Answers1

Linked