81

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

ML_
  • 1,000
  • 1
  • 7
  • 8

3 Answers3

143

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

UPDATE 2020: Google has reprecated previous APIs (again) and has new prices and new limits. Now (https://developers.google.com/custom-search/v1/overview) you can query up to 10k results per day at 1,500 USD per month, more than that is not permitted and the results are not what they display in normal searches.

  • You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

  • If you want a higher amount of API requests you need to pay.

  • 60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

  • Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
  • If you scrape at a rate higher than 8 (updated from 15) keyword requests per hour you risk detection, higher than 10/h (updated from 20) will get you blocked from my experience.
  • By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 1000 requests per hour. (24k a day) (updated)
  • There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.

3) Alternatively use a scraping service (updated)

  • Recently a customer of mine had a huge search engine scraping requirement but it was not 'ongoing', it's more like one huge refresh per month.
    In this case I could not find a self-made solution that's 'economic'.
    I used the service at http://scraping.services instead. They also provide open source code and so far it's running well (several thousand resultpages per hour during the refreshes)
  • The downside is that such a service means that your solution is "bound" to one professional supplier, the upside is that it was a lot cheaper than the other options I evaluated (and faster in our case)
  • One option to reduce the dependency on one company is to make two approaches at the same time. Using the scraping service as primary source of data and falling back to a proxy based solution like described at 2) when required.
John
  • 7,507
  • 3
  • 52
  • 52
  • 8
    The problem I have with this explanation is that even a handful of people sharing the same IP will greatly exceed 20 requests per hour. If this is the whole story then Google would be blocking basically every small business which uses computers heavily on a regular basis. The accepted answer would have the same issue. – krowe Mar 28 '14 at 21:35
  • 6
    Actually Google does captcha block NAT IPs on a regular basis, I've been working in multiple companies and the case of captchas came up several times. I should also have clarified that I meant 20 requests with a different keyword, as long as you stick to the same keyword you can keep browsing the result pages. Also the block will not happen after one hour, you can actually burst Google but if you keep hitting it at a higher rate you will be sent into Captcha-land. Google seems to be kind regarding bursts, but not if you keep going. Just try it out :) – John Mar 28 '14 at 21:39
  • 1
    I've been using it (the search engine scraper and the suggest one) in more than one project. It works quite perfectly. Once in a year or so it stops working due to changes of Google and is usually updated within a few days. – John Feb 17 '15 at 00:29
  • I was trying the same, but instead of captcha solver my bot hits the **403 error** page `http://ipv4.google.com/sorry/index?continue=http://www.google.com/search%3Fq%3Dnewabc%26start%3D0%26safe%3Dactive&q=CGMSBA6L2YoYzdHXwAUiGQDxp4NLtUqumgC0PtvCwbAP0mNmHfOShXQ` where as there is another page that shows the captcha, i am not understanding why it is not hitting the captcha page. I would like to post the code if someone wants to help – Sagar Kar Oct 30 '16 at 12:41
  • Sagar: This is because Google makes a redirect to the captcha page. I think they try to make it harder that people use automated captcha solving services. Actually in the past 1 week the behaviour changed and I've recently seen cases without that redirect. Google constantly changes – John Dec 29 '16 at 18:16
  • If there are those limits, how the keyword ranking tools work? Why aren't they blocked? Where can I find info about the paying API to increase the number of requests? – Aerendir Mar 24 '17 at 23:33
  • @John Please can you add link to Google TOS where it is writen? – Joozty Oct 15 '17 at 16:53
  • 2
    @Joozty: https://www.google.com/intl/en/policies/terms/ "Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide." " We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct." I'm not sure if there are different TOS involved in addition. According to the TOS they reserve the right to stop service to you if you break the rules. That's also the only consequence I know about. – John Oct 16 '17 at 22:55
  • What is the proper API to use? – simPod Feb 24 '18 at 22:15
  • @simPod Google deprecates those APIs regularly. Currently that should be the right one: https://developers.google.com/custom-search/v1/overview It's different than before, now it permits 100 searches per day free. You can do up to 10000 searches per day at 1,500 USD / month and there is no legal offer to exceed 10k/day. It should also be noted that the results won't be like the official Google searches, it's not useful for every purpose. – John Oct 30 '20 at 18:34
  • You said "If you scrape at a rate higher than 8 (updated from 15) keyword requests", does that mean only 8 different result pages an hour? Sorry I couldn't get that. What if we keep scraping the different pages of the same keyword? – Burak Kaymakci Feb 25 '21 at 19:25
  • 1
    @AndréYuhai It will depend on so many factors by now. Scraping Google was easier when I first made the answer. By now I'd have to make the 4th revision I guess. If your keywords are great you can scrape a bit higher than that, if Google detects a similarity it's less. The same for staying inside the keyword (pages), that was simple before and today it's the opposite: try not scrape much beyond 1-2 pages. Getting a captcha now and then was a high-alert a few years ago, today it's not avoidable. In the end you'll need to find it out by slowly experimenting. – John Feb 27 '21 at 19:55
  • @John *Google disallows automated access in their ToS* - where in their ToS does it say that? – user10186832 Feb 16 '23 at 05:06
  • They call it automated queries or similar and list that in every relevant term document they have. However you never legally accepted their terms (except sub terms if you sign up for an account and even that 'signature' is on weak legal shoes) and you can reject their terms at any time. Also there is no legal issue if you break the terms though they would have the right to cancel an account of yours, if you have one. Google is the biggest web scraper in the world, they never asked anyone about it and they never accepted any term of your website. You can ignore theirs as well. – John Feb 16 '23 at 19:29
61

Google will eventually block your IP when you exceed a certain amount of requests.

Severin
  • 8,508
  • 14
  • 68
  • 117
  • 8
    The last time I looked at it I was using an API to search via Google. If I recall correctly that limit was at 2.500 requests/day. – Severin Mar 26 '14 at 12:14
  • Legally not possible but you can try this small tool in envato https://codecanyon.net/item/google-search-scraper/22081561?ref=intelliwins – sambit.albus Jun 11 '18 at 11:34
  • Use https://www.serphouse.com for Google and Bing search API, It also offers free trial with 400 requests and also custom plans on demand – Mehul V. Nov 06 '19 at 06:10
  • 2
    You could always use a third party solution like [SerpApi](https://www.serpapi.com/) to do this for you. It's a paid API with a free trial. They handle proxies, solve captchas, and parse all the rich structured data for you. – Milos Djurdjevic Jun 16 '21 at 19:08
17

Google thrives on scraping websites of the world...so if it was "so illegal" then even Google won't survive ..of course other answers mention ways of mitigating IP blocks by Google. One more way to explore avoiding captcha could be scraping at random times (dint try) ..Moreover, I have a feeling, that if we provide novelty or some significant processing of data then it sounds fine at least to me...if we are simply copying a website.. or hampering its business/brand in some way...then it is bad and should be avoided..on top of it all...if you are a startup then no one will fight you as there is no benefit.. but if your entire premise is on scraping even when you are funded then you should think of more sophisticated ways...alternative APIs..eventually..Also Google keeps releasing (or depricating) fields for its API so what you want to scrap now may be in roadmap of new Google API releases..

raghav
  • 247
  • 2
  • 5