To exceed the ImportXML limit on Google Spreadsheet

Question

I am stucking on a "scraping problem" right now. Especially i want to extract the name of the author from a webpage to google spreadsheet. Actually the function =IMPORTXML(A2,"//span[@class='author vcard meta-item']") is working, but after i raise the amount of links to scrape it just starts to load endless.

So i researched and find out, that this problem is due to the fact, that there is a limit of google.

Does anybody know of to exceed the limit or a script, which i could "easily copy" ? - i really do not have a hunch of coding.

score 26 · Answer 1 · edited Sep 02 '23 at 17:50

I created a custom import function that overcomes all limits of IMPORTXML I have a sheet using this in about 800 cells and it works great.

It makes use of Google Sheet’s custom scripts (Extensions -> Apps Script.…) and searches through content using regex instead of xpath.

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return output;
}

You can then use this function like any function.

=importRegex("https://example.com", "<title>(.*)<\/title>")

Of course, you can also reference cells.

=importRegex(A2, "<title>(.*)<\/title>")

If you don’t want to see HTML entities in the output, you can use this function.

var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};
 
function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;
 
        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

All together…

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return unescapeHTML(output);
}
 
var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};
 
function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;
 
        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

Let's say want to get 2.49% APR on this page: https://actorsfcu.com/loans/auto with this function, how would I get it? I tried the regex but couldn't achieve it. — sojim, Sep 08 '18 at 22:50
How would we pass through quotes in the parameter of the regex with your function? For example `"(.*?%\s+apr)` — sojim, Sep 09 '18 at 22:48
I am unsure what you mean by "pass through quotes." Can you open a new question? — Josh Bradley, Sep 10 '18 at 17:12
Kudos for this great function... Following the example before if `A1="apr"` in theory it'd be possible to something like `>(.*?%\s+"&A1&")` but this function seems not correctly formatted and returns an error... Do you know if it's possible to pass a cell value within the regex portion? — MrSlash, Sep 09 '19 at 07:59
@MrSlash Unfortunately I can't get that to work either. Let me know if you find a fix. — Josh Bradley, Sep 10 '19 at 02:41
Code needs to be entered at Extensions -> Apps Script. I've tried to edit the answer, but I wasn't able to do that. — KWriter, Sep 30 '22 at 07:55

score 0 · Answer 2 · answered Aug 18 '16 at 11:30

0

There is no such script to exceed the limits. Since the code is run on a Google machine (server) you can not cheat. Some limits are bind to your spreadsheet, so you could try to use multiple spreadsheets, if that helps.

answered Aug 18 '16 at 11:30

michaelsinner

376
1
2
10

thanks, thats good idea, but the problem ist, that i dont know the exact limit to divide the data into different spreadsheets. The importxml function also needs a very long time until it can extract the ?span class? i am looking for. – rookie4 Aug 19 '16 at 08:41

To exceed the ImportXML limit on Google Spreadsheet

2 Answers2

Linked

Related