A more efficient 'remove duplicates' function

Question

I manage Google Sheet lists that sometimes exceed 10,000 rows. For sheets with rows up to around 5,000, the remove duplicates function noted below works finely. But for anything above 5,000, I receive the 'Exceeded maximum execution time' error. I would be grateful for some instruction on how to make the code more efficient such that it could run smoothly even for sheets with 10k+ rows.

function removeDuplicates() {
  var sheet = SpreadsheetApp.getActiveSheet();
  var data = sheet.getDataRange().getValues();
  var newData = new Array();
  for(i in data){
    var row = data[i];
    var duplicate = false;
    for(j in newData){
      if(row.join() == newData[j].join()){
        duplicate = true;
      }
    }
    if(!duplicate){
      newData.push(row);
    }
  }
  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
}

You could put a `break;` after `duplicate=true;` I'm not sure which is more efficient but you could try using indexOf(). — Cooper, Jan 24 '18 at 18:31
@pnuts, I haven't, in fact, because I wasn't aware that any existed. Could you kindly direct me to them? — Ed Dev, Jan 24 '18 at 20:08

Jordan Running · Accepted Answer · 2018-01-24T23:56:52.437

There are a couple of things that are making your code slow. Let's look at your two for loops:

for (i in data) {
  var row = data[i];
  var duplicate = false;

  for (j in newData){
    if (row.join() == newData[j].join()) {
      duplicate = true;
    }
  }

  if (!duplicate) {
    newData.push(row);
  }
}

On the face of it, you're doing the right things: For every row in the original data, check if the new data already has a matching row. If it doesn't, add the row to the new data. In the process, however, you're doing a lot of extra work.

Consider, for example, the fact that at any given time, a row in data will have no more than one matching row in newData. But in your inner for loop, after you find that one match, it still continues checking the rest of the rows in newData. The solution to this would be to add a break; after duplicate = true; to stop iterating.

Consider also that for any given j, the value of newData[j].join() will always be the same. Suppose you have 100 rows in data, and no duplicates (the worst case). By the time your function finishes, you'll have calculated newData[0].join() 99 times, newData[1].join() 98 times... all in all you'll have done almost 5,000 calculations to get the same 99 values. A solution to this is memoization, whereby you store the result of a calculation in order to avoid doing the same calculation again later.

Even if you make those two changes, though, your code's time complexity is still O(n²). If you have 100 rows of data, in the worst case the inner loop will run 4,950 times. For 10,000 rows that number is around 50 million.

However, we can do this is O(n) time instead, if we get rid of the inner loop and reformulate the outer loop like so:

var seen = {};

for (var i in data) {
  var row = data[i];
  var key = row.join();

  if (key in seen) {
    continue;
  }
  seen[key] = true;
  newData.push(row);
}

Here, instead of checking every row of newData for a row matching row in every iteration, we store every row we've seen so far as a key in the object seen. Then in each iteration we just have to check if seen has a key matching row, an operation we can do in nearly constant time, or O(1).¹

As a complete function, here's what it looks like:

function removeDuplicates_() {
  const startTime = new Date();
  const sheet = SpreadsheetApp.getActiveSheet();
  const data = sheet.getDataRange().getValues();
  const numRows = data.length;
  const newData = [];
  const seen = {};

  for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
    key = JSON.stringify(row);
    if (key in seen) {
      continue;
    }
    seen[key] = true;
    newData.push(row);
  }

  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);

  // Show summary
  const secs = (new Date() - startTime) / 1000;
  SpreadsheetApp.getActiveSpreadsheet().toast(
    Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
                           numRows, secs, numRows / secs, numRows - newData.length),
    'Remove duplicates', -1);
}

function onOpen() {
  SpreadsheetApp.getActive().addMenu('Scripts', [
    { name: 'Remove duplicates', functionName: 'removeDuplicates_' }
  ]);
}

You'll see that instead of using row.join() this code uses JSON.stringify(row), because row.join() is fragile (['a,b', 'c'].join() == ['a', 'b,c'].join(), for example). JSON.stringify isn't free, but it's a good compromise for our purposes.

In my tests this processes a simple spreadsheet with 50,000 rows and 2 columns in a little over 8 seconds, or around 6,000 rows per second.

Note, that "key-in-object" (key in seen) takes O(log(N)) time, not O(1). So the whole algorithm takes O(N * log(N)), not O(N). If we could do that in constant time, that would be quite a revolution :D — Ivan Kuckir, Jan 24 '18 at 22:48
@JordanRunning You should show the source, that it is O(1). If you ever studied computer science, it should be obvious that this is not possible. Sure, there exist perfect hashing functions and ammortized strucutres, but they are still not O(1). — Ivan Kuckir, Jan 24 '18 at 23:06
@Jordan Running it is a hashmap (chrome can only optimize repeated patterns which this case is clearly not). It will take O(1) on good day. Slightly lower realistically but still not as bad as O(log(n)) (worst case theoretically is O(n)). See https://stackoverflow.com/a/15469844/ — Greenbeard, Jan 24 '18 at 23:10
@JordanRunning Seems like V8 uses [Splay Trees](https://github.com/v8/v8/blob/c076667b7f9fefd793547b1e9bbbde7dc47dd76f/src/splay-tree.h) for this. I am quite surprised, because Red-Black trees are usually used for this (e.g. in .Net or for std:: structures in C++). Anyway, all of them are log(N) structures. — Ivan Kuckir, Jan 24 '18 at 23:28
@IvanKuckir [We’ve already had this discussion](https://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1). If you want to keep rehashing it (ha!) I suggest doing it in chat. — Jordan Running, Jan 24 '18 at 23:33
@JordanRunning how do I start a chat? Your O(1) is still wrong. This has nothing to do with hashing. Even hashing is not O(1). You should not trust "scientists" on StackOverflow. — Ivan Kuckir, Jan 24 '18 at 23:54
@JordanRunning, many thanks. That was a very helpful explanation. — Ed Dev, Jan 25 '18 at 11:46
@JordanRunning, I wonder whether you might be so kind as to also provide instruction on a related question I asked yesterday about making a 'remove keywords' function more efficient. — Ed Dev, Jan 26 '18 at 12:53

A more efficient 'remove duplicates' function

1 Answers1