Get Rid of Saved Duplicates from Web Scraped Data On Google Sheets

Question

The code that was given to me works by saving web scraped data in google sheets, however, when it saves data it creates duplicates of the same data entry from previously scheduled web scrapes. Is there a way we can create a code where it only saves unique rows and adds in edited/ updated info? I have added the code below:

function myFunction() {
  SpreadsheetApp.getActiveSheet().getRange('A1').setValue('=importdata("https://www.parsehub.com/api/v2/projects/tZOywrNXQ3Q4/last_ready_run/data?api_key=tn6CGEyTTVxE&format=csv")')
}
// custom menu function
function onOpen() {
  var ui = SpreadsheetApp.getUi();
  ui.createMenu('Custom Menu').addItem('Save Data','saveData').addToUi().
}
// function to save data
function saveData() { 
  var ss=SpreadsheetApp.getActiveSpreadsheet(); 
  var sh4=ss.getSheetByName('Zapier Tax Leads'); 
  var sh3=ss.getSheetByName('Current Tax Leads') 
  var data=sh3.getRange(1,1,sh3.getLastRow(),33).getValues(); 
  sh4.getRange(sh4.getLastRow()+1,1,data.length,data[0].length).setValues(data);
}
function removeDuplicates() {
var sheet = SpreadsheetApp.getActive().getSheetByName('Zapier Tax Leads')
var data = sheet.getDataRange().getValues();
var newData = new Array();
for(i in data){
var row = data[i];
var duplicate = false;
for(j in newData){
  if(row.join() == newData[j].join()){
    duplicate = true;
  }
}
if(!duplicate){
  newData.push(row);
}
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, 
newData[0].length).setValues(newData);
}

Edit:

The tab labeled "Zapier Tax Leads" is where the code copies over the duplicate data.
In tab "Current Tax Leads" the columns that we will be updating are only column H through O.
Spreadsheet link: https://docs.google.com/spreadsheets/d/1G1PLs46cnQ-CyJWBI0ibKNmYosSEQRcrAGE8Qx2MArU/edit?usp=sharing

Could you please provide a copy of the spreadsheet you're working on, free of sensitive information, clarifying the desired outcome? — Iamblichus, Jul 31 '20 at 08:21
Here is the link, the last tab labeled "Zapier Tax Leads" is where the code copies over the duplicate data. I set the permission to view only, let me know if you need to make edits: https://docs.google.com/spreadsheets/d/1G1PLs46cnQ-CyJWBI0ibKNmYosSEQRcrAGE8Qx2MArU/edit?usp=sharing — Kristan Servidad, Jul 31 '20 at 12:47
Ok, thank you for providing the link. So, if I understand you correctly, you just want to remove the duplicate rows from `Zapier Tax Leads`. Is that correct? Are the duplicate rows completely identical, or it's just certain columns that have identical values? — Iamblichus, Jul 31 '20 at 13:02
It's a mix of both in terms of the duplicates. There will be rows that are completely identical, and some that will have updated data added to the row. It adds the same set of data every hour along with any newly updated data from the tab sheets in front, this creates a TON of duplicates. The new code we need is to look for these duplicates and only save data that has been updated or added a new unique row. Thanks so much! — Kristan Servidad, Jul 31 '20 at 13:57
If there are some which are not completely identical, what criteria should be used to know which rows should be removed? What columns should be unique? And in this case, which ones of the `duplicates` should be removed and which one should be kept? — Iamblichus, Jul 31 '20 at 14:00
The duplicate should only be removed if they are exactly the same. If the row contains the same data, but there is more/ edited information added later into that row, the row needs to be either replaced with the new info or be able to insert the new info into the cell of the corresponding row. — Kristan Servidad, Jul 31 '20 at 16:02
Thank you for responding. Unfortunately, the situation is still not clarified by your last comment. You say `If the row contains the same data, but there is more/ edited information added later into that row, ...`, but how to tell apart if it's the same row with updated data, or a completely different row? There should be some column/s that identify an item, and tell the script whether it's the same row with updated data or a different row. Which columns are those? — Iamblichus, Aug 03 '20 at 12:15
I apologize for not making it clear, in tab "Current Tax Leads" the columns that we will be updating are only column H through O — Kristan Servidad, Aug 06 '20 at 16:43
Does this answer your question? [Removing Duplicate Rows in a google Spreadsheet from the end row](https://stackoverflow.com/questions/47725977/removing-duplicate-rows-in-a-google-spreadsheet-from-the-end-row) — Iamblichus, Aug 07 '20 at 07:44
It looks like this is to only remove duplicates within one specific row, can we apply this to the entire sheet? I am not an expert in coding, so I am unsure how to incorporate the changes we need to make. — Kristan Servidad, Aug 07 '20 at 13:55
To clarify, the "Current Tax Leads" tab is where we will be adding new information in columns H through O, and the "Zapier Tax Leads" is where the saved data is being stored creating duplicates. Also, there is a trigger set in place to run every hour which creates many duplicates. We need this code to delete the duplicates in "Zapier Tax Leads" — Kristan Servidad, Aug 07 '20 at 14:16

Servus · Answer 1 · 2020-08-11T18:12:26.680

1

Not the exact solution but the sheet-api's DeleteDuplicatesRequest should work.

I have not tested it, but this should work. (I presume you use the drive package for nodejs?)

var requests = {
   "requests": [{
        "deleteDuplicates": {
            //rows to operate on
            //first value is kept if there are duplicates
            "range": {
                    "sheetId": integer,
                    "startRowIndex": integer,
                    "endRowIndex": integer,
                    "startColumnIndex": integer,
                    "endColumnIndex": integer
            }
            ,
            // The columns in the range to analyze for duplicate values.
            //  If no columns are selected then all columns are
            //  analyzed for duplicates.
            "comparisonColumns": [{
                    "sheetId": integer,
                    "dimension": "COLUMNS",
                    "startIndex": integer,
                    "endIndex": integer
             }]
          }
    }]
 }

sheets.spreadsheets.values.batchUpdate(spreadsheetId=spreadsheet_id,body=requests)

edited Aug 11 '20 at 18:12

answered Aug 11 '20 at 14:40

Servus

479
3
13

I think this might be correct in fixing this issue, but how do we embed this into the current code for this to function correctly? Thanks again for the help. – Kristan Servidad Aug 11 '20 at 16:34
I added this to the end of the current code, but I got "ReferenceError: integer is not defined" – Kristan Servidad Aug 11 '20 at 19:29
I mean I can't give you the full implementation, there has to be some initiative on your side. You have to replace `integer` with the respective numbers, `sheetId` with your sheet id (probably `0`, since its the first sheet). And there will probably be some other syntax errors. This is just to outline the idea. – Servus Aug 12 '20 at 09:36
As I have just noticed, you use `SpreadsheetApp`, so try [this](https://stackoverflow.com/questions/48655614/google-scripts-delete-duplicates-from-select-range-not-whole-sheet) – Servus Aug 12 '20 at 09:47
I noticed you linked me to the SpreadsheetApp code, so I used that one instead of the one you gave me and added it to the bottom of the original code. It looks like it runs but it's still not removing the duplicates. The duplicates need to be removed from spreadsheet 4. I have updated the code above with your suggestions, any advice? – Kristan Servidad Aug 13 '20 at 13:38
I assume you mean Sheet 4 not Spreadsheet 4? Then you can replace `var sheet = SpreadsheetApp.getActiveSheet();` from the link above to `var sheet = SpreadsheetApp.getActive().getSheetByName('YourSheetName')` – Servus Aug 13 '20 at 16:02
Hmm still not working, I edited the code above to reflect the changes we made. Any other suggestions? – Kristan Servidad Aug 13 '20 at 16:25
No sorry. If your programm does not have to be efficient, you could brute force delete duplicates, reading every line and comparing them... Although I am not sure why the suggestion from the other stack overflow question does not work in your case? – Servus Aug 14 '20 at 18:22

Get Rid of Saved Duplicates from Web Scraped Data On Google Sheets

Edit:

1 Answers1