-2

I am trying to get a web page using the following sample code:

from urllib import urlopen
print urlopen("http://www.php.net/manual/en/function.gettext.php").read()

Now I can get the whole web page in a variable. I wanna get a part of the page containing something like this

<div class="methodsynopsis dc-description">
   <span class="type">string</span><span class="methodname"><b>gettext</b></span> ( <span class="methodparam"><span class="type">string</span> <tt class="parameter">$message</tt></span>
   )</div>

So that i can generate a file to implement in another application. I wanna be able to extract the words "string", "gettext" and "$message".

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • 2
    Variations of this question have been asked many times on SO. This is the definitive answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Dave Kirby Sep 25 '10 at 07:53

2 Answers2

2

Why don't you try using BeautifulSoup

Example code :

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(htmldoc)
allSpans = soup.findAll('span', class="type")
for element in allSpans:
    ....
pyfunc
  • 65,343
  • 15
  • 148
  • 136
1

When extracting information from HTML, it isn't recommended to just hack some regexes together. The right way to do it is to use a proper HTML parsing module. Python has several good modules for this purpose - in particular I recommend BeautifulSoup.

Don't be put off by the name - it's a serious module used by a lot of people with great success. The documentation page has a lot of examples that should help you get started with your particular needs.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412