0

I would like to scrape data from a page, but cannot figure out the right xpath for Google sheets. I would like to extract the number 202 from https://www.belvilla.nl/zoeken/?land=nl&rgo=frie (on top of the page, "202 vakantiehuizen gevonden in Friesland")

If I take the xpath, I get: //*[@id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong

In Google sheets I have tried =IMPORTXML(A1;"//*[@id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong)") and some others like =IMPORTXML(A1;"//div[@class='search-numbers']"), but none of them are working. For the last one I get an error with 'Resource with URL content has exceeded the size limit.' but I'm guessing my xpath is wrong.

Can anyone help me out? Thanks!

Ilsed
  • 3
  • 2
  • 1
    See https://stackoverflow.com/questions/64172780/importxml-resource-at-url-contents-exceeded-maximum-size regarding the error you received from Google Sheets – David Denenberg Jul 02 '21 at 17:55

2 Answers2

0

google sheets do not support the scraping of JavaScript elements. you can check this if you disable JS for a given URL and you will be left with content you could import. in your case, this cant be achieved with IMPORTXML:

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124
0

IMPORTXML has its limitations especially on JS elements. However, if scripting is an option, try using UrlFetchApp.fetch() in Google Apps Script.

Code:

function fetchUrl(url) {
  var html = UrlFetchApp.fetch(url).getContentText();
  // startString and endString must be unique or at least the first result 
  // enclosing the result we want
  var startString = 'search-result-div" ';
  var endString = 'alternate-dates-filter-bar';
  var startIndex = html.search(startString);
  var endIndex = html.search(endString);
  // regex for numbers and text content
  var numbers = /strong>([^<]+)<\/strong/;
  var text = /span>([^<]+)<\/span/;
  // clean content then combine matches of numbers and text
  var content =  html.substring(startIndex, endIndex).replace(/\s\s+/g, ' ');
  var result = numbers.exec(content)[1] + ' ' + text.exec(content)[1];
  return result.trim();
}

Output:

output

Note:

  • Code above is specific to what you are fetching. You will need to update the script processing of the response if you want anything else.
  • You can reuse this on other url and will fetch the similar value located on your wanted xpath in your post.
  • This doesn't make use of the xpath.
NightEye
  • 10,634
  • 2
  • 5
  • 24