3

I tried to run the code here.

However, I go the following message.

Did I miss some parameters?

What should be the correct approach to use requests to get the search?

Thank you very much.

This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.

import requests 
from bs4 import BeautifulSoup

headers_Get = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }


def google(q):
    s = requests.Session()
    q = '+'.join(q.split())
    url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
    r = s.get(url, headers=headers_Get)
    return r.text

result = google('"apple"')
John
  • 691
  • 1
  • 7
  • 20
  • Google serves that page to you precisely because they don't want you doing exactly what you're doing. (And, depending on where you live, intentionally circumventing their Terms of Service might carry legal consequences.) You might be able to change something about the requests you send to fool Google temporarily, but don't count on fooling them forever. You don't really want to get into an arms race with Google. – Daniel Pryden Feb 09 '19 at 02:32
  • Google bans bots/crawlers for utilizing their search to prevent people from building alternative search engines using their resources. Their detection capabilities are very advanced, involving not only checking request headers, but also using sophisticated Javascript techniques that detect mouse/keyboard interactions, network traffic, and a variety of other things. You're unlikely to defeat Google's bot detection systems. ... But in general, to defeat bot detection on most sites, you can simply pass request headers (requests lets you do this) resembling those of a common browser like Firefox. – J. Taylor Feb 09 '19 at 02:33
  • Possible duplicate of [google search with python requests library](https://stackoverflow.com/questions/22623798/google-search-with-python-requests-library) – Daniel Pryden Feb 09 '19 at 02:38
  • 1
    @J.Taylor: He's already sending a `User-Agent` header claiming to be Firefox. – Daniel Pryden Feb 09 '19 at 02:41
  • This answer describes how you can create a custom search and API key to search the entire web https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search – Lance Feb 09 '19 at 02:45
  • Possible duplicate of [Programmatically searching google in Python using custom search](https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search) – Lance Feb 09 '19 at 02:46

2 Answers2

2

I was using this for Google and it worked:

import requests
from urllib.request import Request, urlopen
import urllib
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 
537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html,application/xhtml+xml,application/xml; 
q=0.9,image/webp,*/*;q=0.8"}

def google(q):
    q = '+'.join(q.split())
    url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
    reqest = Request(url,headers=headers)
    page = urlopen(reqest)
    soup = BeautifulSoup(page)
    return r.text
0

It might be because user-agent is somewhat "wrong" Check what's your user-agent. Changing user-agent to the one from the attached link could help to get the full HTML output.

Also, you do not really need to create Session() if you don't want to persist certain parameters across requests or make several requests to the same host with the same parameters.

import requests 
from bs4 import BeautifulSoup

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
}

def google(q):
    response = requests.get(f'https://www.google.com/search?q={q}', headers=headers).text
    return response

result = google('"apple"')
print(result)

Alternatively, you can get results fast without thinking about such things by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

The difference is that, well, you have to think about the data you want to get, rather than figuring out how to bypass blocks or all sort of other things and maintain it over time.

Code to integrate (for example, you want to scrape each title, link from first page of organic results):

import os
from serpapi import GoogleSearch

def serpapi_get_google_result():
    params = {
      "engine": "google",               # search engine to search from
      "q": "tesla",                     # query
      "hl": "en",                       # language
      "gl": "us",                       # country to search from
      "api_key": os.getenv("API_KEY"),  # https://serpapi.com/dashboard
    }

    search = GoogleSearch(params)
    results = search.get_dict()
  
    for result in results["organic_results"]:
      print(result['title'])
      print(result['link'])


serpapi_get_google_result()

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35