6

Goal: Pass a search string to search on google and scrape url, title and the small description that get publish along with the url title.

I have following code and at the moment my code only gives first 10 results which is the default google limit for one page. I am not sure how to really handle pagination during webscraping. Also when I look at the actual page results and the what prints out there is a discrepancy. I am also not sure what is the best way to parse span elements.

So far I have the span as follows and I want to remove the <em> element and concatenate the rest of the stings. What would be the best way to do that?

<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span

Code:

from BeautifulSoup import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')

My out put looks like this:

http://www.crummy.com/software/BeautifulSoup/
<span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span>
http://pypi.python.org/pypi/BeautifulSoup/3.2.1
<span class="st"><span class="f">Feb 16, 2012 &ndash; </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span>
http://www.beautifulsouptheatercollective.org/
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
http://lxml.de/elementsoup.html
<span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span>
https://launchpad.net/beautifulsoup/
<span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> &middot; Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is  the current focus of development <b>...</b><br /></span>
http://www.poetry-online.org/carroll_beautiful_soup.htm
<span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span>
http://www.youtube.com/watch?v=hDG73IAO5M8
<span class="st"><span class="f">Jul 6, 2009 &ndash; </span>taken from the motion picture &quot;Alice in wonderland&quot; (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span>
http://www.soupsong.com/
<span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span>
http://www.facebook.com/beautifulsouptc
<span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We&#39;re thrilled to announce the cast of <em>Beautiful Soup&#39;s</em> upcoming production of <b>...</b><br /></span>
http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/
<span class="st"><span class="f">Mar 15, 2009 &ndash; </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#39;s simply no way around it; so I should better confess it in <b>...</b><br /></span>

Google search page results has the following structure:

<li class="g">
<div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ">
<h3 class="r">
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
<div class="f kv">
<div id="poS5" class="esc slp" style="display:none">
<div class="f slp">3 answers&nbsp;-&nbsp;Jan 16, 2009</div>
<span class="st">
I read this without finding the solution:
<b>...</b>
The "normal" way is to: Go to the
<em>Beautiful Soup</em>
web site,
<b>...</b>
Brian beat me too it, but since I already have
<b>...</b>
<br>
</span>
</div>
<div>
</div>
<h3 id="tbpr_6" class="tbpr" style="display:none">
</li>

each search results get listed under <li> element.

add-semi-colons
  • 18,094
  • 55
  • 145
  • 232

3 Answers3

2

This list comprehension will strip the tag.

>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
[None]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
ChrisGuest
  • 3,398
  • 4
  • 32
  • 53
  • any idea how i can get more than 10 records scrape from the results? – add-semi-colons Jul 17 '12 at 05:35
  • iterate through the 'start' parameters in the URL: `num=10&hl=en&start=0` `num=10&hl=en&start=10` `num=10&hl=en&start=20` – ChrisGuest Jul 17 '12 at 05:44
  • Hi Chris, above solution didn't work, so I edited. But I see you have removed it. I will add my solution to it. Thanks for looking into it. – add-semi-colons Jul 17 '12 at 17:55
  • 1
    NH, If this didn't work you I'd be happy to see the case that failed. While you can use regular expressions to stip tags in simple case like this, it is a very bad practice to get into (see link below). RegEx approaches rapidly become unworkable with real world complexity. If you are already using a powerful package like BeuatifulSoup to build your DOM, you might as well keep things simple and manipulate the DOM with the same tool too. Note: you're original question only asked for the stripping of the tags. If you just want the text content you can do `sSpan.text` . – ChrisGuest Jul 18 '12 at 00:43
  • [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – ChrisGuest Jul 18 '12 at 00:46
  • @Null-Hypothesis - You can get more than 10 results by changing the value of num. Try `num=50` – JRodDynamite Jul 23 '14 at 03:10
0

I constructed a simple html regular expression and then called the replace function on the cleaned up string to remove the dots

import re

p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')

Before

<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

After

The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things, 
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
0

To get text element from the span tag you can use .text/get_text() methods that beautifulsoup provides. Bs4 do all hard lifting and you don't need to worry about how to get rid of <em> tag.

Code and full example (Google won't show more than ~400 results.):

from bs4 import BeautifulSoup
import requests, lxml, urllib.parse


def print_extracted_data_from_url(url):
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')
    print()

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_text = container.find('h3', class_='LC20lb DKV0Md').text
        head_sum = container.find('div', class_='IsZvec').text
        head_link = container.a['href']
        print(head_text)
        print(head_sum)
        print(head_link)
        print()

    return soup.select_one('a#pnnext')


def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=coca cola')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com',
                                             next_page_node['href'])

        next_page_node = print_extracted_data_from_url(next_page_url)

scrape()

Output:

Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola

The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Contact Us · ‎Careers · ‎Coca-Cola · ‎Coca-Cola System
https://www.coca-colacompany.com/home

Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/

Together Tastes Better | Coca-Cola®
Coca-Cola is pairing up with celebrity chefs, talented athletes and more surprise guests all summer long to bring you and your loved ones together over the love ...
https://us.coca-cola.com/

Alternatively, you can achieve this using Google Search Engine Results API from SerpApi. It's a paid API with a free plan Check out the Playground to test.

Code to integrate:

import os
from serpapi import GoogleSearch

def scrape():
  
  params = {
    "engine": "google",
    "q": "coca cola",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")

Output:

Results from SerpApi

Current page: 1
Title: The Coca-Cola Company: Refresh the World. Make a Difference
Link: https://www.coca-colacompany.com/home

Title: Coca-Cola
Link: https://www.coca-cola.com/

Title: Together Tastes Better | Coca-Cola®
Link: https://us.coca-cola.com/

Title: Coca-Cola - Wikipedia
Link: https://en.wikipedia.org/wiki/Coca-Cola

Title: Coca-Cola - Home | Facebook
Link: https://www.facebook.com/Coca-Cola/

Title: The Coca-Cola Company | LinkedIn
Link: https://www.linkedin.com/company/the-coca-cola-company

Title: Coca-Cola UNITED: Home
Link: https://cocacolaunited.com/

Title: World of Coca-Cola: Atlanta Museum & Tourist Attraction
Link: https://www.worldofcoca-cola.com/

Current page: 2
Title: Coca-Cola (@CocaCola) | Twitter
Link: https://twitter.com/cocacola?lang=en

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35