I run a website that provides various pieces of data in chart/tabular format for people to read. Recently I've noticed an increase in the requests to the website that originate from Google Docs. Looking at the IPs and User Agent, it does appear to be originating from Google servers - example IP lookup here.
The number of hits is in the region of 2,500 to 10,000 requests per day.
I assume that someone has created one or more Google Sheets that scrape data from my website (possibly using the IMPORTHTML function or similar). I would prefer that this did not happen (as I cannot know if the data is being attributed properly).
Is there a preferred way to block this traffic that Google supports/approves?
I would rather not block based on IP addresses, as blocking Google servers feels wrong and may lead to future problems or IPs could change. At the moment I am blocking (returning 403 status) based on User Agent containing GoogleDocs
or docs.google.com
.
Traffic is mostly coming from 66.249.89.221 and 66.249.89.223 at present, always with the user agent Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; http://docs.google.com)
As a secondary question: Is there a way to trace the document or its account owner? I have access to the URLs that they are accessing, but little else to go on as the requests appear to proxy through the Google Docs servers (no Referer, Cookies or other such data in the HTTP logs).
Thank you.