1

I am trying to create a google sheets document that uses the product number from a Sigma Aldrich and copies certain information from the product. When I try to use the built in importxml tool from google sheets I receive an error message that says "Could not fetch the URL". An example of a XPath element and url would be =importxml("https://www.sigmaaldrich.com/catalog/product/aldrich/364525","/h1"). I have also tried a web scraper as shown on: https://eikhart.com/blog/google-sheets-scraper using Cheerio which did not work sigmaaldrich.com.

The importfromweb add-on worked, but had a monthly limit. Could you give any suggestions on how I could solve this issue?

Vincent
  • 11
  • 1

1 Answers1

2

I think it has to do with the politics of google to follow the site's indication of web crawling.

It will see if the target site allows or not the crawling of its pages, so it will check the site's robots.txt (documentation about robots.txt) page to see what it can and cannot fetch.

If you check by yourself the robots.txt file of the site, it does not allow a lot of folders to be accessed by the search engines, so even tough there is not the /catalog/product in there, it may have an indication on one of the pages that it does not allow web scraping.

You can look for a scraper that does this job for you, or you can build your own, however, I think with google sheets you won't go very far with the attempt to fetch the information from the site you are targeting.

Solution:

  • if you know a bit of python, look for beautiful soup or selenium to build a web crawler
nabais
  • 1,981
  • 1
  • 12
  • 18
  • What do you mean by "it has to do with the politics of google to follow the site's indication of web crawling" – Kessy Oct 12 '20 at 14:43
  • what I meant is that either the URL the user is trying to fetch is in the robots.txt and so it would not be fetched, or the website owner has setup some type of firewall to disallow traffic from google sheets to fetch their URLs. another thread with the google sheets robots is here: https://stackoverflow.com/questions/47908176/capture-element-using-importxml-with-xpath – nabais Oct 12 '20 at 14:53
  • For more information about the robots.txt itself, you can see here: https://en.wikipedia.org/wiki/Robots_exclusion_standard – nabais Oct 12 '20 at 14:54
  • 1
    More information from Google [Robots.txt Specifications](https://developers.google.com/search/reference/robots_txt) – Kessy Oct 19 '20 at 09:39