Text between tag using SAX parser in Python

Question

I want to print the text between a particular tag in an XML file using SAX.

However, some of the text output consist of spaces or a newline character.

Is there a way to just pick out the actual strings? What am I doing wrong?

See code extract and XML document below.

(I get the same effect with both Python 2 and Python 3.)

#!/usr/bin/env python3

import xml.sax

class MyHandler(xml.sax.ContentHandler):

        def startElement(self, name, attrs):
                self.tag = name

        def characters(self, content):
                if self.tag == "artist":
                        print('[%s]' % content)

if __name__=='__main__':
        parser=xml.sax.make_parser()
        Handler=MyHandler()
        parser.setContentHandler(Handler) #overriding default ContextHandler
        parser.parse("songs.xml")

<?xml version="1.0"?>
<genre catalogue="Pop">
  <song title="No Tears Left to Cry">
    <artist>Ariana Grande</artist>
    <year>2018</year>
    <album>Sweetener</album>
  </song>
  <song title="Delicate">
    <artist>Taylor Swift</artist>
    <year>2018</year>
    <album>Reputation</album>
  </song>
  <song title="Mrs. Potato Head">
    <artist>Melanie Martinez</artist>
    <year>2015</year>
    <album>Cry Baby</album>
  </song>
</genre>

ok clarified the wording, edited to include full minimal example code and added XML document. — NetHead, Oct 29 '21 at 20:41
Another issue is pylint complains that tag is defined outside __init__ (code W0201), However, if I add an __init__ method to the class, pylint complains that this method is not called from the base class (code W0231), — NetHead, Nov 03 '21 at 00:35

score 0 · Answer 1 · answered Oct 30 '21 at 11:33

If you want to use SAX then you need a solid understanding of the XML specification. The technical name for the white space is 'mixed content'. It occurs before the first child tag, between child tags and after the final child tag. Most XML processors will report SAX events for mixed content. Some have a flag for suppressing it (because many applications are only interested in text-only content or element-only content).

Solutions include:

a) Stop using SAX. DOM would be a lot more straightforward

b) Add code to detect the startElement and endElement events for the tag(s) that you're interested in. Ignore events unless you're inside one of your 'interesting' tags.

c) use XSLT to turn your XML document into whatever form you require (see How to transform an XML file using XSLT in Python?)

My choice would always be c) because XSLT is a superpower, and it makes this type of task very simple.

Hi @kimbert, Thanks for your answer. Assuming use of SAX, is there a specific fix for this issue, other than something like amending the code to ```if self.tag == "parameterName" and content[0] != '\n' and content[0] != ' ': names.append(content)``` ? — NetHead, Nov 01 '21 at 15:29
There is always more than one way to do it :-). If you only care about one tag name, your solution will work. But I don't think you need to test for the '\n' and ' ' unless you want to suppress leading whitespace in the first character of the 'parameterName' tag. — kimbert, Nov 02 '21 at 13:25

score 0 · Accepted Answer · answered Nov 02 '21 at 16:38

The value of self.tag is set to "artist" when the <artist> start tag is encountered, and it does not change until startElement() is called for the <year> start tag. Between those elements is some uninteresting whitespace for which SAX events are also reported by the parser.

One way to get around this is to add an endElement() method to MyHandler that sets self.tag to something else.

def endElement(self, name):
    self.tag = "whatever"

Text between tag using SAX parser in Python

2 Answers2