3

I started using the HTMLParser in Python to extract data from a website. I get everything I wanted, except the text within two tags of HTML. Here is an example of the HTML tag:

<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>

There are also other tags starting with . They have other attributes and values and therefore I do not want to have their data:

<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>

The tag is an embedded tag within a table. I don't know if this makes any difference between other tags. I only want the information in some of the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag, in the example it would be "Swahili". So what I did is:

class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""


    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag
                    #self.lastname = name
                    #self.lastvalue = value
                    print self.lasttag
                    #print self.lastname
                    #print self.lastvalue
                    #return tag
                    print self.countLanguages




    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False
            #print "".join(self.data)

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            #self.dataArray.append(data)
            #
            print data

The programm prints every data which is included in an tag, but I only want the one included in the tag with the right attributes. How do I get this specific data?

goFrendiAsgard
  • 4,016
  • 8
  • 38
  • 64
IssnKissn
  • 81
  • 1
  • 1
  • 6

2 Answers2

6

Looks like you forgot to set self.inLink = False in handle_starttag by default:

from HTMLParser import HTMLParser


class AllLanguages(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None

    def handle_starttag(self, tag, attrs):
        self.inLink = False
        if tag == 'a':
            for name, value in attrs:
                if name == 'class' and value == 'Vocabulary':
                    self.countLanguages += 1
                    self.inLink = True
                    self.lasttag = tag

    def handle_endtag(self, tag):
        if tag == "a":
            self.inlink = False

    def handle_data(self, data):
        if self.lasttag == 'a' and self.inLink and data.strip():
            print data


parser = AllLanguages()
parser.feed("""
<html>
<head><title>Test</title></head>
<body>
<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>
<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="English" class="Vocabulary">English</a>
<a href="http://wold.livingsources.org/vocabulary/2" title="Russian" class="Vocabulary">Russian</a>
</body>
</html>""")

prints:

Swahili
English
Russian

Also, take a look at:

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks a lot. I expected it to be sth small ;). I try beautifulsoup too and this also works perfect. Thanks again for your help. – IssnKissn May 28 '13 at 06:23
  • Do you have a recommendation of using a special parser? I need the data of the html-file and want to write it in an xml-file. Which one would you use? Or what are the advantages of one of the parser? – IssnKissn May 28 '13 at 07:42
  • Well, beautifulspoup and lxml are decent html parsers. lxml is famous for it's speed, beautifulsoup is pretty handy but doesn't support xpath expressions. See more: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/, http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml?rq=1, http://stackoverflow.com/questions/6494199/parsing-html-with-python-2-7-htmlparser-sgmlparser-or-beautiful-soup. – alecxe May 28 '13 at 07:49
  • Well, I have to parse a lot of data, therefore beautifulsoup is pretty slow. But I think I will try lxml. Thanks a lot – IssnKissn May 28 '13 at 12:48
3

You may try HTQL (http://htql.net). The query for:

"the tags called 'a' with the attribute class="Vocabulary" and I want the data within the tag"

is:

<a (class='Vocabulary')>:tx 

The python code is something like this:

import htql
a=htql.query(page, "<a (class='Vocabulary')>:tx")
print(a)
seagulf
  • 380
  • 3
  • 5