0

Trying to crawl through google search results. This code works pretty well with all the other sites, I have tried, however not working with google. It returns an empty list.

from BeautifulSoup import BeautifulSoup
import requests

def googlecrawler(search_term):
    url="https://www.google.co.in/?gfe_rd=cr&ei=UVSeVZazLozC8gfU3oD4DQ&gws_rd=ssl#q="+search_term
    junk_code=requests.get(url)
    ok_code=junk_code.text
    good_code=BeautifulSoup(ok_code)
    best_code=good_code.findAll('h3',{'class':'r'})
    print best_code


googlecrawler("healthkart") 

It should return something like this.

<h3 class="r"><a href="/url?  sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=6&amp;cad=rja&amp;uact=8&amp;ved=0CEIQFjAF&amp;url=http%3A%2F%2Fwww.coupondunia.in%2Fhealthkart&amp;ei=qFmfVc2fFNO0uASti4PwDQ&amp;usg=AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg&amp;sig2=QgwxMBdbPndyQTSH10dV2Q" onmousedown="return rwt(this,'','','','6','AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg','QgwxMBdbPndyQTSH10dV2Q','0CEIQFjAF','','',event)" data-href="http://www.coupondunia.in/healthkart">HealthKart Coupons: July 2015 Coupon Codes</a></h3>
Tushar Bakaya
  • 45
  • 2
  • 10
  • 4
    Crawling Google is against their terms of service, and they reserve the right to put technical impediments in place to enforce those terms. Thus, any answer we gave you would potentially break in fairly short order as the impediments enforcing it were improved. If you want to search Google programmatically, sign up for a key to do so via the supported API. – Charles Duffy Jul 10 '15 at 05:47
  • oh okay..was just trying to crawl them for fun..Understood..thank you :) – Tushar Bakaya Jul 10 '15 at 05:51
  • @TusharBakaya If you use Chrome or Firefox to look at the page source, you should see that it's actually returned as Javascript and then assembled into HTML client-side. My guess is you were inspecting elements instead, which will show you the post-JS result. `BeautifulSoup` is only grabbing the original JS-based source. – Matthew Trevor Jul 10 '15 at 05:54
  • that makes complete sense. any idea how to go about this? – Tushar Bakaya Jul 10 '15 at 08:31
  • possible duplicate of [How can you search Google Programmatically Java API](http://stackoverflow.com/questions/3727662/how-can-you-search-google-programmatically-java-api) – Charles Duffy Jul 10 '15 at 16:21

2 Answers2

0

Looking at good_code i can't see a h3 or class "r" at all. That would be why your code is returning an empty list.

There is no problem with your code as such, but rather, that what you are searching for is not there.

What were you expecting to return?

user3636636
  • 2,409
  • 2
  • 16
  • 31
0

One of the common solutions is to add user-agent and pass intro request headers to fake real user visit:

# https://www.whatismybrowser.com/guides/the-latest-user-agent/
headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

So your code will look like this:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

def googlecrawler(search_term): 
    html = requests.get(f'https://www.google.com/search?q=', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    for container in soup.findAll('div', class_='tF2Cxc'):
        title = container.select_one('.DKV0Md').text
        link = container.find('a')['href']
        print(f'{title}\n{link}')

googlecrawler('site:Facebook.com Dentist gmail.com')

# part of the output:
'''
COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419

Spinelli Dental - General Dentist - Rochester, New York ...
https://www.facebook.com/spinellidental/about/?referrer=services_landing_page

LaboSmile USA Dentist & Dental Office in Delray ... - Facebook
https://www.facebook.com/labosmileusa/
'''

Alternatively, you can do it by using Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

Code to integrate:

from serpapi import GoogleSearch
import os, json, re

params = {
  "engine": "google",
  "q": "site:Facebook.com Dentist gmail.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  print(f'{title}\n{link}\n')

# part of the output:
'''
Green Valley Dental - About | Facebook
https://www.facebook.com/GVDFamily/about/

My Rivertown Dentist - About | Facebook
https://www.facebook.com/Rivertownfamily/about/

COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35