10

I am stucking on a "scraping problem" right now. Especially i want to extract the name of the author from a webpage to google spreadsheet. Actually the function =IMPORTXML(A2,"//span[@class='author vcard meta-item']") is working, but after i raise the amount of links to scrape it just starts to load endless.

So i researched and find out, that this problem is due to the fact, that there is a limit of google.

Does anybody know of to exceed the limit or a script, which i could "easily copy" ? - i really do not have a hunch of coding.

Rubén
  • 34,714
  • 9
  • 70
  • 166
rookie4
  • 111
  • 1
  • 1
  • 4

2 Answers2

26

I created a custom import function that overcomes all limits of IMPORTXML I have a sheet using this in about 800 cells and it works great.

It makes use of Google Sheet’s custom scripts (Extensions -> Apps Script.…) and searches through content using regex instead of xpath.

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return output;
}

You can then use this function like any function.

=importRegex("https://example.com", "<title>(.*)<\/title>")

Of course, you can also reference cells.

=importRegex(A2, "<title>(.*)<\/title>")

If you don’t want to see HTML entities in the output, you can use this function.

var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};
 
function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;
 
        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};

All together…

function importRegex(url, regexInput) {
  var output = '';
  var fetchedUrl = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
  if (fetchedUrl) {
    var html = fetchedUrl.getContentText();
    if (html.length && regexInput.length) {
      output = html.match(new RegExp(regexInput, 'i'))[1];
    }
  }
  // Grace period to not overload
  Utilities.sleep(1000);
  return unescapeHTML(output);
}
 
var htmlEntities = {
  nbsp:  ' ',
  cent:  '¢',
  pound: '£',
  yen:   '¥',
  euro:  '€',
  copy:  '©',
  reg:   '®',
  lt:    '<',
  gt:    '>',
  mdash: '–',
  ndash: '-',
  quot:  '"',
  amp:   '&',
  apos:  '\''
};
 
function unescapeHTML(str) {
    return str.replace(/\&([^;]+);/g, function (entity, entityCode) {
        var match;
 
        if (entityCode in htmlEntities) {
            return htmlEntities[entityCode];
        } else if (match = entityCode.match(/^#x([\da-fA-F]+)$/)) {
            return String.fromCharCode(parseInt(match[1], 16));
        } else if (match = entityCode.match(/^#(\d+)$/)) {
            return String.fromCharCode(~~match[1]);
        } else {
            return entity;
        }
    });
};
Mr Shane
  • 520
  • 5
  • 18
Josh Bradley
  • 1,854
  • 1
  • 15
  • 31
  • 1
    Let's say want to get 2.49% APR on this page: https://actorsfcu.com/loans/auto with this function, how would I get it? I tried the regex but couldn't achieve it. – sojim Sep 08 '18 at 22:50
  • 2
    `>(.*?%\s+apr)` – Josh Bradley Sep 09 '18 at 06:07
  • How would we pass through quotes in the parameter of the regex with your function? For example `"(.*?%\s+apr)` – sojim Sep 09 '18 at 22:48
  • I am unsure what you mean by "pass through quotes." Can you open a new question? – Josh Bradley Sep 10 '18 at 17:12
  • Kudos for this great function... Following the example before if `A1="apr"` in theory it'd be possible to something like `>(.*?%\s+"&A1&")` but this function seems not correctly formatted and returns an error... Do you know if it's possible to pass a cell value within the regex portion? – MrSlash Sep 09 '19 at 07:59
  • @MrSlash Unfortunately I can't get that to work either. Let me know if you find a fix. – Josh Bradley Sep 10 '19 at 02:41
  • Code needs to be entered at Extensions -> Apps Script. I've tried to edit the answer, but I wasn't able to do that. – KWriter Sep 30 '22 at 07:55
0

There is no such script to exceed the limits. Since the code is run on a Google machine (server) you can not cheat. Some limits are bind to your spreadsheet, so you could try to use multiple spreadsheets, if that helps.

michaelsinner
  • 376
  • 1
  • 2
  • 10
  • thanks, thats good idea, but the problem ist, that i dont know the exact limit to divide the data into different spreadsheets. The importxml function also needs a very long time until it can extract the ?span class? i am looking for. – rookie4 Aug 19 '16 at 08:41