There are a couple of things that are making your code slow. Let's look at your two for
loops:
for (i in data) {
var row = data[i];
var duplicate = false;
for (j in newData){
if (row.join() == newData[j].join()) {
duplicate = true;
}
}
if (!duplicate) {
newData.push(row);
}
}
On the face of it, you're doing the right things: For every row in the original data, check if the new data already has a matching row. If it doesn't, add the row to the new data. In the process, however, you're doing a lot of extra work.
Consider, for example, the fact that at any given time, a row in data
will have no more than one matching row in newData
. But in your inner for
loop, after you find that one match, it still continues checking the rest of the rows in newData
. The solution to this would be to add a break;
after duplicate = true;
to stop iterating.
Consider also that for any given j
, the value of newData[j].join()
will always be the same. Suppose you have 100 rows in data
, and no duplicates (the worst case). By the time your function finishes, you'll have calculated newData[0].join()
99 times, newData[1].join()
98 times... all in all you'll have done almost 5,000 calculations to get the same 99 values. A solution to this is memoization, whereby you store the result of a calculation in order to avoid doing the same calculation again later.
Even if you make those two changes, though, your code's time complexity is still O(n²). If you have 100 rows of data, in the worst case the inner loop will run 4,950 times. For 10,000 rows that number is around 50 million.
However, we can do this is O(n) time instead, if we get rid of the inner loop and reformulate the outer loop like so:
var seen = {};
for (var i in data) {
var row = data[i];
var key = row.join();
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
Here, instead of checking every row of newData
for a row matching row
in every iteration, we store every row we've seen so far as a key in the object seen
. Then in each iteration we just have to check if seen
has a key matching row
, an operation we can do in nearly constant time, or O(1).1
As a complete function, here's what it looks like:
function removeDuplicates_() {
const startTime = new Date();
const sheet = SpreadsheetApp.getActiveSheet();
const data = sheet.getDataRange().getValues();
const numRows = data.length;
const newData = [];
const seen = {};
for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
key = JSON.stringify(row);
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
// Show summary
const secs = (new Date() - startTime) / 1000;
SpreadsheetApp.getActiveSpreadsheet().toast(
Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
numRows, secs, numRows / secs, numRows - newData.length),
'Remove duplicates', -1);
}
function onOpen() {
SpreadsheetApp.getActive().addMenu('Scripts', [
{ name: 'Remove duplicates', functionName: 'removeDuplicates_' }
]);
}
You'll see that instead of using row.join()
this code uses JSON.stringify(row)
, because row.join()
is fragile (['a,b', 'c'].join() == ['a', 'b,c'].join()
, for example). JSON.stringify
isn't free, but it's a good compromise for our purposes.
In my tests this processes a simple spreadsheet with 50,000 rows and 2 columns in a little over 8 seconds, or around 6,000 rows per second.