46

I am trying to get a value out of a HTML page using the python HTMLParser library. The value I want to get hold of is within this HTML element:

...
<div id="remository">20</div>
...

This is my HTMLParser class so far:

class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.seen = {}

  def handle_starttag(self, tag, attributes):
    if tag != 'div': return
    for name, value in attributes:
    if name == 'id' and value == 'remository':
      #print value
      return

  def handle_data(self, data):
    print data

p = LinksParser()
f = urllib.urlopen("http://example.com/somepage.html")
html = f.read()
p.feed(html)
p.close()

I want the class functionality to get the value 20.

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Martin
  • 10,294
  • 11
  • 63
  • 83
  • 1
    If you are doing a lot of HTML parsing, try [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). – zvone Jul 18 '10 at 15:58
  • 4
    Is that library included as a python std library? I have come across it but chose to stick with HTMLParser. – Martin Jul 18 '10 at 16:33
  • 1
    @zvone Why is BeautifulSoup better for html parsing? Is it still a recommended module? Thanks. – tommy.carstensen Mar 28 '16 at 20:11
  • 1
    @tommy.carstensen [BeautifulSoup4](http://www.crummy.com/software/BeautifulSoup/) is generally recommended to use for things like web scraping and parsing HTML for specific tags. It has methods for locating specific tags, uses the lxml and html5lib libraries, and handles conversion of incoming documents to Unicode and converts outgoing ones to UTF-8 for you. In short, it does everything you might want to do to an ugly HTML page in just a few short lines. Check out [the bs4 docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)! :) – DJGrandpaJ Mar 29 '16 at 13:57
  • 1
    @tommy.carstensen I haven't used BeautifulSoup (or parsed HTML) for years. There may be something better out there now. Anyway, what was good about it was that it behaved much better with badly structured HTML. Invalid HTML can be seen more often than valid HTML, so behaving well with it is always a plus. – zvone Mar 29 '16 at 22:20
  • @DJGrandpaJ Thanks for the link to the docs/tutorial. It looks super easy to use. I'll give it a spin next time I need to parse some html. Thanks! – tommy.carstensen Mar 30 '16 at 00:28

4 Answers4

69
class LinksParser(HTMLParser.HTMLParser):
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.recording = 0
    self.data = []

  def handle_starttag(self, tag, attributes):
    if tag != 'div':
      return
    if self.recording:
      self.recording += 1
      return
    for name, value in attributes:
      if name == 'id' and value == 'remository':
        break
    else:
      return
    self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'div' and self.recording:
      self.recording -= 1

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

self.recording counts the number of nested div tags starting from a "triggering" one. When we're in the sub-tree rooted in a triggering tag, we accumulate the data in self.data.

The data at the end of the parse are left in self.data (a list of strings, possibly empty if no triggering tag was met). Your code from outside the class can access the list directly from the instance at the end of the parse, or you can add appropriate accessor methods for the purpose, depending on what exactly is your goal.

The class could be easily made a bit more general by using, in lieu of the constant literal strings seen in the code above, 'div', 'id', and 'remository', instance attributes self.tag, self.attname and self.attvalue, set by __init__ from arguments passed to it -- I avoided that cheap generalization step in the code above to avoid obscuring the core points (keep track of a count of nested tags and accumulate data into a list when the recording state is active).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    Thanks Alex, that code works perfectly (apart from this line "if tag == div and self.recording:" - div should be a string). What I meant by the class returning a value was actually as you described, a function within the class to return the required value. Or I could easily access the 'data' variable. The dictionary I had in there was just some remnance of me testing possible solutions :) Thanks for your help! – Martin Jul 18 '10 at 15:38
  • 1
    +1 for the count of nested `div`s that is not so obvious for who approach html parsing for the first time. – mg. Jul 18 '10 at 15:49
  • @Martin, you're welcome, and +1 for spotting my distraction -- I'll edit now to fix (quote `div` and remove that dict & comment) for more usefulness to future readers. – Alex Martelli Jul 18 '10 at 16:22
  • what if the data is unicode, for example, the data is japanese or chinese, how can i append it to the data[] list? – おおさま Jun 07 '13 at 10:45
31

Have You tried BeautifulSoup ?

from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="remository">20</div>')
tag=soup.div
print(tag.string)

This gives You 20 on output.

modzello86
  • 433
  • 7
  • 16
6

Little correction at Line 3

HTMLParser.HTMLParser.__init__(self)

it should be

HTMLParser.__init__(self)

The following worked for me though

import urllib2

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

  def __init__(self):
    HTMLParser.__init__(self)
    self.recording = 0
    self.data = []
  def handle_starttag(self, tag, attrs):
    if tag == 'required_tag':
      for name, value in attrs:
        if name == 'somename' and value == 'somevale':
          print name, value
          print "Encountered the beginning of a %s tag" % tag
          self.recording = 1

  def handle_endtag(self, tag):
    if tag == 'required_tag':
      self.recording -=1
      print "Encountered the end of a %s tag" % tag

  def handle_data(self, data):
    if self.recording:
      self.data.append(data)

 p = MyHTMLParser()
 f = urllib2.urlopen('http://www.example.com')
 html = f.read()
 p.feed(html)
 print(p.data)
 p.close()
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
pshirishreddy
  • 746
  • 6
  • 20
  • 3
    actually you're able to do that because you specified `from HTMLParser import HTMLParser`, which allows you to directly call HTMLParser. It's unfortunate that they both have the same name, but they're two different entities. You could also do something like `from HTMLParser import HTMLParser as parser` and then just use `class MyHTMLParser(parser)` – Nona Urbiz Jan 24 '11 at 23:22
-2

This works perfectly:

print (soup.find('the tag').text)
Undo
  • 25,519
  • 37
  • 106
  • 129
helu
  • 5
  • 2