Block Website Scraping by Google Docs

Question

I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.

The number of hits is in the region of 2,500 to 10,000 requests per day.

I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).

Is there a preferred way to block this traffic that Google supports/approves?

I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs or docs.google.com.

Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)

As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).

Thank you.

score 7 · Accepted Answer · answered Apr 10 '17 at 04:19

Blocking on User-Agent is great solution because there doesn't appear to be a way to set a different User-Agent and still use INPUTHTML function -- and since you're happy to ban 'all' usage from doc-sheets, that's perfect.

Additional thoughts, though if full on ban seems unpleasant:

Rate limit it: as you say you're recognizing it's mostly coming from two IP and always with the same user agent, just slow down your response. As long as the requests are serial, the you can provide data, yet at a pass which may be sufficient to discourage scraping. Delay your response (to suspected scrapers) by 20 or 30 seconds.
Redirect to "You're blocked" screen, or screen with "default" data (i.e., scrapable, but not with current data). Better than basic 403 because it will tell the human it's not for scraping and then you can direct them to purchasing access (or at least requesting a key from you.)

score 0 · Answer 2 · answered Apr 10 '17 at 09:26

You can force the issue by setting a cookie on the first attempt and serve a response only if the cookie is present. This way any "simple" imports will not work as in the first request the cookie is not there so it will be nothing to be read by a third party.

Block Website Scraping by Google Docs

2 Answers2

Linked