1

Currently I am practicing on the basic concept of accessing web using python. I am following a tutorial on YouTube and was guided till the following code.

from urllib2 import urlopen,  HTTPError
from BeautifulSoup import BeautifulSoup
import re


url="http://getbusinessreviews.org/"
try:
   webpage = urlopen(url).read
except HTTPError, e:  
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
pathFinderTitle = re.compile('<h2 class="entry-title"><a href.* rel="bookmark">(.*)</a></h2>')
if  webpage:
    if pathFinderTitle:
        findPathTitle = re.findall(pathFinderTitle,webpage)
    else:
        print "unable to get path finder title"

else:
    print "unable to url open "
listIterator =[]
listIterator[:]= range(2,10)

for i in listIterator:
    print findPathTitle[i]

i want to extract "Nutracoster" from the following HTML

        <h2 class="entry-title">

            <a href="http://getbusinessreviews.org/nutracoster/" rel="bookmark">Nutracoster</a>

        </h2>

I've got two questions

  1. I am getting no results at the moment can any one guide me what am I doing wrong?(I guess my regular expression is not well defined)

  2. How can i pass this Regular expression to Beautifulsoup ?

Thanks in advance and sorry for any silly mistakes since i am at learning stage :D

NightGale
  • 40
  • 9
  • 1
    Answer to your question 3: Yes. `for pathTitle in findPathTitle: ...`. I suggest that you start by learning python basics before diving into complicated stuff like HTML parsing and regexes. – Jasper Nov 22 '15 at 14:11
  • Agree with @Jasper, if you want to learn web scraping I would learn beautifulsoup without regex first as it will be easier for you to debug and understand one new concept instead of two. – dstudeba Nov 22 '15 at 14:55
  • Thanks for the suggestion i really appreciate it but unfortunately the task is assigned to me by my team lead and have got a very short deadline.I need to create the script that would scrap the above mentioned web and save its post in a csv file. – NightGale Nov 22 '15 at 15:04
  • P.s. I've done the part 3 my self and i know a bit of basic python :) – NightGale Nov 22 '15 at 15:15
  • *I need to create the script that would scrap the above mentioned web and save its post in a csv file* It would have helped if you had said this originally as you don't need regex for that. – dstudeba Nov 22 '15 at 15:59

1 Answers1

1

You doesn't need to use a regex to select an element with Beautiful Soup: it can extract all the <h2> tags with specific attributes by itself.

Further, it's better to not use a regex to parse HTML (see this popular question).

Try this little snippet of code:

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen, HTTPError, URLError

url = "http://getbusinessreviews.org/"
try:
    webpage = urlopen(url)
except HTTPError, e:
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
except URLError, e:
    print e.args

soup = BS(webpage, 'lxml')

## Relevant lines ##
for h2 in soup.find_all("h2", attrs={"class": "entry-title"}):
    print h2.text
Community
  • 1
  • 1
Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32