0

How do I avoid being blocked by Google for querying their search engine via requests? I iterate through a list of dates so that I may get the results for a query like Microsoft Release for each month in the list.

I am currently changing user agents and adding time.sleep of 10s in between requests but I always get blocked. How do I use proxies in conjuction with my approach? Is there a better way to do this?

from bs4 import BeautifulSoup
import requests

http_proxy  = "http://10.10.1.10:3128"
https_proxy = "https://10.10.1.11:1080"
ftp_proxy   = "ftp://10.10.1.10:3128"

proxyDict = { 
          "http"  : http_proxy, 
          "https" : https_proxy, 
          "ftp"   : ftp_proxy
        }

page_response = requests.get('https://www.google.com/search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150',\
                                     timeout=60, verify=False, headers={'User-Agent': random.choice(user_agents)}, proxies=proxyDict)
soup = BeautifulSoup(page_response.content, 'html.parser')

I then get the following error:

ConnectTimeout: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1811499358>, 'Connection to 10.10.1.11 timed out. (connect timeout=60)'))

Any idea how to counter this error and to make it work?

Bilal Siddiqui
  • 349
  • 3
  • 17
  • Obviously 10.10.1.11 is not responding to `CONNECT` requests. If those are real IP addresses, note that using an internal proxy won't help you not get blacklisted. When they say "use proxies" they mean a multitude of open proxies, not the ones you install on the same network. – Selcuk Jan 08 '19 at 23:29
  • I'm planning on buying open proxies but how to implement them in Python once I get them? Also, is this 100% certain I won't get blacklisted if i use them? – mike.depetriconi Jan 08 '19 at 23:40
  • There is always a possibility that you will get blocked by Google. BTW the data that you are querying is available from Microsoft. – Life is complex Jan 09 '19 at 05:49
  • Any ideas on how to not being blocked by Google in my case? – mike.depetriconi Jan 09 '19 at 20:32

1 Answers1

0

One way is to do something like this:

# https://stackoverflow.com/a/13395324/15164646
proxies = {
  'http': 'HTTP_PROXY'
}

Which becomes (example in the online IDE how to scrape Google Scholar with a proxy):

requests.get('https://www.google.com/search?q=Microsoft+Release&rlz=1C1GCEA_enGB779&tbs=cdr:1,cd_min:'+startDate+',cd_max:'+endDate+'&source=inms&tbm=nws&num=150', proxies=proxies, headers=headers).text
...

An alternative solution is to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

The main difference in this particular example is that you don't have to maintain the parser thus finding ways to avoid blocking from Google. It's already done for the end-user. Check out the playground.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35