List of items find almost duplicates

Question

Within excel I have a list of artists, songs, edition. This list contains over 15000 records. The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.

So for example some records:

ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official

Each dash indicates a separate column (so 3 columns A, B, C are filled in)

How would I mark them as duplicates within Excel?

I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.

Any regex magic or vba script what can help me out? It'd also be alright to see how much similar the row is (say 80% similar).

This is the only fuzzy matching code I've found to be useful: http://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html — Darren Bartrup-Cook, Aug 19 '16 at 13:28

score 1 · Accepted Answer · edited May 23 '17 at 10:27

One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:

https://stackoverflow.com/a/4243652/1278553

From there, you can use the function directly in your spreadsheet to find similarities between instances:

You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:

select
  s1.group, s2.group, s1.song, s2.song,
  levenshtein (s1.group, s2.group) as group_match,
  levenshtein (s1.song, s2.song) as song_match
from
  songs s1
  cross join songs s2
order by
  group_match, song_match

Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.

Nathan_Sav · Answer 2 · 2016-08-19T13:48:23.373

0

You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000

=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)

This way will also look back up the list, to indicate the ones found

=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))

The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +

edited Aug 19 '16 at 13:48

answered Aug 19 '16 at 13:25

Nathan_Sav

8,466
2
13
20

How would i use this in a list full of data where the data is not sorted and the similar one doesn't always have to be below the other one..? – Baklap4 Aug 19 '16 at 13:27
It would be detected in the one above, so ABC123 before ABC999 will be ABC will show as dupe next to 123, if the other way it will show by 999 one dupe will always be above. – Nathan_Sav Aug 19 '16 at 13:28

score -1 · Answer 3 · answered Aug 19 '16 at 13:19

-1

try this worksheet fucntions in your loop:

=COUNTIF(Range,"*yourtexttofind*")

answered Aug 19 '16 at 13:19

Doug Coats

6,255
9
27
49

Sorry but this won't do the trick. This function returns the amount of items found with the specific text. So if i look for "ABBA - Mama Mia!" it'll just return 1. It won't detect the other row and treats it as something completely different. What i need is to find an approximate match. And if it is an approximate match mark it duplicate. – Baklap4 Aug 19 '16 at 13:25
Using the *'s will add wild card's – Nathan_Sav Aug 19 '16 at 13:27
Actually it will work, i use it to find approximate values in one of my at work tasks. I normally need to find "Fail" across a 15000 count spreadsheet. Problem is that a lot of the time is "Fail-Something Else" or its "
some text Fail some more text
. SO yeah it works fine for me :P – Doug Coats Aug 19 '16 at 13:31
yes, this is the easiest route, just use something like this `countif(range, "*" & mid(a1,1,int(len(a1)*0.8))) & "*")` – Nathan_Sav Aug 19 '16 at 13:47

List of items find almost duplicates

3 Answers3