Unable to define regular expression for re.compile and pass it to Beautifulsoup

Question

Currently I am practicing on the basic concept of accessing web using python. I am following a tutorial on YouTube and was guided till the following code.

from urllib2 import urlopen,  HTTPError
from BeautifulSoup import BeautifulSoup
import re


url="http://getbusinessreviews.org/"
try:
   webpage = urlopen(url).read
except HTTPError, e:  
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
pathFinderTitle = re.compile('<h2 class="entry-title"><a href.* rel="bookmark">(.*)</a></h2>')
if  webpage:
    if pathFinderTitle:
        findPathTitle = re.findall(pathFinderTitle,webpage)
    else:
        print "unable to get path finder title"

else:
    print "unable to url open "
listIterator =[]
listIterator[:]= range(2,10)

for i in listIterator:
    print findPathTitle[i]

i want to extract "Nutracoster" from the following HTML

        <h2 class="entry-title">

            <a href="http://getbusinessreviews.org/nutracoster/" rel="bookmark">Nutracoster</a>

        </h2>

I've got two questions

I am getting no results at the moment can any one guide me what am I doing wrong?(I guess my regular expression is not well defined)
How can i pass this Regular expression to Beautifulsoup ?

Thanks in advance and sorry for any silly mistakes since i am at learning stage :D

Answer to your question 3: Yes. `for pathTitle in findPathTitle: ...`. I suggest that you start by learning python basics before diving into complicated stuff like HTML parsing and regexes. — Jasper, Nov 22 '15 at 14:11
Agree with @Jasper, if you want to learn web scraping I would learn beautifulsoup without regex first as it will be easier for you to debug and understand one new concept instead of two. — dstudeba, Nov 22 '15 at 14:55
Thanks for the suggestion i really appreciate it but unfortunately the task is assigned to me by my team lead and have got a very short deadline.I need to create the script that would scrap the above mentioned web and save its post in a csv file. — NightGale, Nov 22 '15 at 15:04
P.s. I've done the part 3 my self and i know a bit of basic python :) — NightGale, Nov 22 '15 at 15:15
*I need to create the script that would scrap the above mentioned web and save its post in a csv file* It would have helped if you had said this originally as you don't need regex for that. — dstudeba, Nov 22 '15 at 15:59

score 1 · Accepted Answer · edited May 23 '17 at 11:51

1

You doesn't need to use a regex to select an element with Beautiful Soup: it can extract all the <h2> tags with specific attributes by itself.

Further, it's better to not use a regex to parse HTML (see this popular question).

Try this little snippet of code:

from bs4 import BeautifulSoup as BS
from urllib2 import urlopen, HTTPError, URLError

url = "http://getbusinessreviews.org/"
try:
    webpage = urlopen(url)
except HTTPError, e:
    if e.code == 404:
        e.msg = 'data not found on remote: %s' % e.msg
    raise
except URLError, e:
    print e.args

soup = BS(webpage, 'lxml')

## Relevant lines ##
for h2 in soup.find_all("h2", attrs={"class": "entry-title"}):
    print h2.text

edited May 23 '17 at 11:51

Community

1
1

answered Nov 22 '15 at 15:13

Giuseppe Ricupero

6,134
3
23
32

Thank you very much. I really appreciate your effort. You have saved my efforts in wrong direction. Thanks a lot!!! – NightGale Nov 22 '15 at 20:45
@NightGale: glad to hear that, if you find my answer satisfactory please accept it or explain what is missing, i may expan my answer as well. – Giuseppe Ricupero Nov 23 '15 at 11:19
No problem, happy coding! – Giuseppe Ricupero Nov 26 '15 at 10:20

Unable to define regular expression for re.compile and pass it to Beautifulsoup

1 Answers1