0

I'm trying to get a description and email from each of Google searches, but it returns only titles and links. I'm using Selenium to open pages and bs4 to scrape the actual content.

What am I doing wrong? Please help. Thanks!

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
emails = []
phones = []

for r in result_div:
# Checks if each element is present, else, raise exception
    try:
    # link
        link = r.find('a', href=True)

    # title
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

    # desc
        description = None
        description = r.find('div', attrs={'class': 'IsZvec'})
        #description = r.find('span')
    

        if isinstance(description, Tag):
            description = description.get_text()
            print(description)
    # email

        email = r.find(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))

2 Answers2

1

Main issue here is that the class names are dynamic, so you have to change your strategy and select your elements by tag or id.

...
data = []

for e in soup.select('div:has(> div > a h3)'):
    data.append({
        'title':e.h3.text,
        'url':e.a.get('href'),
        'desc':e.next_sibling.text,
        'email':m.group(0) if (m:= re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', e.parent.text)) else None
    })
    
data

Output

[{'title': 'Email design at Stack Overflow',
  'url': 'https://stackoverflow.design/email/guidelines/getting-started/',
  'desc': 'An email design system that helps us work together to create consistently-designed, properly-rendered email for all Stack Overflow users.',
  'email': None},
 {'title': 'Is email from do-not-reply@stackoverflow.email legit? - Meta ...',
  'url': 'https://meta.stackoverflow.com/questions/338332/is-email-from-do-not-replystackoverflow-email-legit',
  'desc': '23.11.2016 · 1\xa0AntwortYes it is legit. We use it to protect stackoverflow.com user cookies from third parties. The links in the email are all rewritten to a\xa0...',
  'email': 'do-not-reply@stackoverflow.email'},
 {'title': "Newest 'email' Questions - Stack Overflow",
  'url': 'https://stackoverflow.com/questions/tagged/email',
  'desc': 'Use this tag for questions involving code to send or receive email messages. Posting to ask why the emails you send are marked as spam is off-topic for Stack\xa0...',
  'email': None},
 {'title': 'Contact information - contact us today - Stack Overflow',
  'url': 'https://stackoverflow.co/company/contact',
  'desc': "A private, secure home for your team's questions and answers. Perfect for teams of 10-500 members. No more digging through stale wikis and lost emails—give your\xa0...",
  'email': None},
 {'title': 'How can I get the email of a stackoverflow user? - Meta Stack ...',
  'url': 'https://meta.stackexchange.com/questions/64970/how-can-i-get-the-email-of-a-stackoverflow-user',
  'desc': '18.09.2010 · 1\xa0AntwortYou can\'t. Read your own profile. The e-mail box says "never displayed". The closest we have to private messaging is commenting as a reply\xa0...',
  'email': None},...]
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Hey, thanks a lot! But when I pasted your code, it throws invalid syntax at m:= re.search... Since I suck at bs4, could u help me a bit here, please? – bostjan jaro Apr 01 '22 at 07:35
  • Is your python version up to date? Else you have to use `'email': re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', e.parent.text).group(0) if re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', e.parent.text) else None` to check or use `try/except` before appending – HedgeHog Apr 01 '22 at 07:45
0

To scrape Google Search you can use only Beautifulsoup webscraping library without selenium webdriver that will increase the speed of the script.

To avoid blocks from Google, if using requests could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on, as default user-agent in requests library is a python-requests so the website can understand that it's a script that sends a request.

To collect the necessary information (email, description, title, number, etc.) you can use CSS selectors search which are easy to identify on the page using a SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).

import requests, re, json, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}

params = {
  'q': 'Facebook.com Dantist gmail.com',   # query
  'hl': 'en',                              # language
  'gl': 'us'                               # country of the search, US -> USA
}

html = requests.get(f'https://www.google.com/search',
                    headers=headers,
                    params=params).text
soup = BeautifulSoup(html, 'lxml')

data = []

for result in soup.select('.tF2Cxc'):
    title = result.select_one('.DKV0Md').text
    link = result.find('a')['href']
    snippet = result.select_one('.lyLwlc').text
       
    match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', snippet)
    email = ''.join(match_email)

    # https://stackoverflow.com/a/3868861/15164646
    match_phone = re.findall(r'((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))', snippet)
    phone = ''.join(match_phone)
    
    data.append({
        'Title': title,
        'Link': link,
        'Email': email if email else None,
        'Phone': phone if phone else None
    })

print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "Title": "Island Dental Associates | Franklin Square NY - Facebook",
    "Link": "https://www.facebook.com/IslandDentalAssociates/",
    "Email": "islanddentalassociatesny@gmail.com",
    "Phone": "(516) 271-0585"
  },
  {
    "Title": "Dental Bright | Houston TX - Facebook",
    "Link": "https://www.facebook.com/DentalBrightHouston/",
    "Email": "Dentalbrighttx@gmail.com",
    "Phone": "(713) 783-6060"
  },
  # ...
]

As an alternative, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
import os, json, re

params = {
   "engine": "google",                         # search engine. Google, Bing, Yahoo, Naver, Baidu...
   "q": "Facebook.com Dantist gmail.com",      # search query
   "api_key": os.getenv('API_KEY')             # your serpapi api key
}
 
search = GoogleSearch(params)                  # where data extraction happens
results = search.get_dict()                    # JSON -> Python dictionary

data = []

for result in results['organic_results']:
   title = result['title']
   link = result['link']
   snippet = result['snippet']

   match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', snippet)
   email = '\n'.join(match_email)

   match_phone = re.findall(r'((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))', snippet)
   phone = ''.join(match_phone)

   data.append({
     'title': title,
     'link': link,
     'email': email if email else None,
     'phone': phone if phone else None
   })

print(json.dumps(data, indent=2, ensure_ascii=False))

Output:

The answer is identical to the answer bs4.
Denis Skopa
  • 1
  • 1
  • 1
  • 7