1

Test File

Sometimes, my lists of emails include duplicate addresses for the same person. For example, Jane's addresses are both "jane.doe@email.com" and "doe.jane@email". Her variants include replacing the "." with "-" or "_". At the moment, my duplicates script—upgraded ever so kindly by @Jordan Running and Ed Nelson—takes care of 'strict' duplicates, yet cannot detect that "doe.jane@email.com" is a 'complicated' duplicate of "jane.doe@email.com". Is there a way to delete even these duplicates such that I do not email more than one of Jane's addresses? All of them point to the same inbox, so I need only include one of her addresses.

Here is my current code:

function removeDuplicates() {
  const startTime = new Date();
  const newData = [];
  const sheet = SpreadsheetApp.getActiveSheet();
  const data = sheet.getDataRange().getValues();
  const numRows = data.length;
  const seen = {};

  for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
    key = JSON.stringify(row);
    if (key in seen) {
      continue;
    }
    seen[key] = true;
    newData.push(row);
  };

  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);

  // Show summary
  const secs = (new Date() - startTime) / 1000;
  SpreadsheetApp.getActiveSpreadsheet().toast(
    Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
                           numRows, secs, numRows / secs, numRows - newData.length),
    'Remove duplicates', -1);
}
Max Makhrov
  • 17,309
  • 5
  • 55
  • 81
Ed Dev
  • 105
  • 7
  • 2
    Making these relationships will be very specific to each user and likely near impossible, consider `jane.doe@gmail.com` and `jdoe@gmail.com`. I would suggest when collecting emails, having a primary / secondary email. – Phillip Thomas Jan 31 '18 at 16:00
  • 2
    Yes, agree with Phillip Thomas - I have some experience with this and it becomes very difficult to eliminate potential duplicates with any certainty (in Phillip Thomas example is jdoe.gmail.com jane doe or it is joe doe?) In the end the best I came up with is to run a function which alerted potential duplicates and allowed a human to make the final decision. – Tony Duffill Jan 31 '18 at 16:12
  • Noted with thanks, @PhillipThomas, yet is there a way to still catch just the jane.doe@email.com and doe.jane@email.com variants? – Ed Dev Jan 31 '18 at 20:50
  • Not really: the point is that you cannot know if `jane.doe@email.com` and `doe.jane@email.com` are in fact the same person. – msanford Feb 02 '18 at 13:39
  • @msanford, thanks for your note. I assume as much since in 99% of cases in my sheets, those addresses direct to the same inbox. – Ed Dev Feb 02 '18 at 13:49

2 Answers2

1

Sample File

Fuzzy match test

enter image description here

Notes:

  • used without @email.com part, it distorts the result
  • use a the custom function: =removeDuplicatesFuzzy(B2:B12,0.66)
  • 0.66 is a percentage of fuzzy match.
  • the right column of a result (Column D) shows found matches with > 0.66 accuracies. Dash - is when matches are not found ("unique" values)

Background

You may try this library: https://github.com/Glench/fuzzyset.js

To install it, copy the code from here.

The usage is simple:

function similar_test(string1, string2)
{
  string1 = string1 || 'jane.doe@email.com';
  string2 = string2 || 'doe.jane@email.com'
  a = FuzzySet();
  a.add(string1);
  var result = a.get(string2);
  
  Logger.log(result);  // [[0.6666666666666667, jane.doe@email.com]]
  
  return result[0][0]; // 0.6666666666666667
}

There's also more info here: https://glench.github.io/fuzzyset.js/

Notes:

  • please google more info, look for javascript fuzzy string match. Here's related Q: Javascript fuzzy search that makes sense. Note: the solution should work in Google Sheets (no ECMA-6)
  • this algorithm is not smart like a human, it tests a string by char. If you have two similar strings like don.jeans@email.com it will be 84% similar to doe.jane@email.com but human detects it is completely another person.
Community
  • 1
  • 1
Max Makhrov
  • 17,309
  • 5
  • 55
  • 81
  • Thanks, @Max. I'm not sure I understand. I copied the code from the link and then pasted it into my test sheet but I am receiving errors I've never seen before. My test sheet is here: https://docs.google.com/spreadsheets/d/1RQxZ2yMeARdjGAcoJfx3Mtgdb7IM35WUICk7I-Q5AuU/edit?usp=sharing – Ed Dev Feb 01 '18 at 19:04
  • 1
    Hi, @ed-dev, thank you for the test file. I've made some tests and still the best result I have is far from your goal. Please try my sample, if it helps you. I can't find a better solution so far. But I like your question and want to propose bounty for it, to see the better approaches. – Max Makhrov Feb 02 '18 at 13:09
  • Thank you, @Max Makhrov. I'm not able to view the script code with your sample file given the permissions. (FYI, the link to 'Test Fuzzy Match' in your answer differs from the 'Test sheet 2' link in my post, which might cause confusion if you set a bounty.) – Ed Dev Feb 02 '18 at 13:47
  • for a start, I think it would even be an improvement to remove less complicated duplicates such as john-doe@email.com whenever there is a john.doe@email.com. – Ed Dev Feb 02 '18 at 18:21
  • Hi @Max. I wonder whether you've had any more thoughts on this question. – Ed Dev Feb 04 '18 at 10:41
0

Search for my Google Sheets add-on called Flookup. It should do what you want.

For your case, you can use this function:

ULIST(colArray, [threshold])

The parameter details are:

  1. colArray: the column from which unique values are to be returned.
  2. threshold: the minimum percentage similarity between the colArray values that are not unique.

Or you can simply use the Highlight duplicates or Remove duplicates from the add-on menu. The key feature is that you can adjust the level of strictness by changing the percentage similarity.

Bonus: It will easily catch swaps like jane.doe@email.com / doe.jane@email.com

You can find out more at the official website.