It's the most common question on the StackOverFlow that is being asked 200+ times in [requests]
and [bs4]
tags, and pretty much every solution lies down to simply adding user-agent
.
User-agent
is needed to act as a "real" user visit when the bot or browser sends a fake user-agent
string to announce themselves as a different client.
When no user-agent
is being passed to request headers
while using requests
library, it defaults to python-requests and Google understands that it's a bot/script, then it blocks a request (or whatever it does) and you receive a different HTML (with some sort of an error) with different CSS
selectors. Check what's your user-agent
. List of user-agents
.
Note: Adding user-agent
doesn't mean that it will fix the problem and you still can get a 429 (or different) error, even when rotating user-agents
.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines. In short, you need:
- rotate user-agent.
- add proxies (and rotate them)
- captcha solver to solve Google (or another website captcha)
- browserless (browser automation, optional)
Pass user-agent
:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Google and figuring out why certain things don't work.
Disclaimer, I work for SerpApi.