Python regular expression slicing

Question

I am trying to get a web page using the following sample code:

from urllib import urlopen
print urlopen("http://www.php.net/manual/en/function.gettext.php").read()

Now I can get the whole web page in a variable. I wanna get a part of the page containing something like this

<div class="methodsynopsis dc-description">
   <span class="type">string</span><span class="methodname"><b>gettext</b></span> ( <span class="methodparam"><span class="type">string</span> <tt class="parameter">$message</tt></span>
   )</div>

So that i can generate a file to implement in another application. I wanna be able to extract the words "string", "gettext" and "$message".

Variations of this question have been asked many times on SO. This is the definitive answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Dave Kirby, Sep 25 '10 at 07:53

score 2 · Answer 1 · answered Sep 25 '10 at 05:47

2

Why don't you try using BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/

Example code :

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmldoc)
allSpans = soup.findAll('span', class="type")
for element in allSpans:
    ....

answered Sep 25 '10 at 05:47

pyfunc

65,343
15
148
136

score 1 · Answer 2 · answered Sep 25 '10 at 05:43

When extracting information from HTML, it isn't recommended to just hack some regexes together. The right way to do it is to use a proper HTML parsing module. Python has several good modules for this purpose - in particular I recommend BeautifulSoup.

Don't be put off by the name - it's a serious module used by a lot of people with great success. The documentation page has a lot of examples that should help you get started with your particular needs.

Python regular expression slicing

2 Answers2