4

The task is to parse a simple XML document, and analyze the contents by line number.

The right Python package seems to be xml.sax. But how do I use it?

After some digging in the documentation, I found:

  • The xmlreader.Locator interface has the information: getLineNumber().
  • The handler.ContentHandler interface has setDocumentHandler().

The first thought would be to create a Locator, pass this to the ContentHandler, and read the information off the Locator during calls to its character() methods, etc.

BUT, xmlreader.Locator is only a skeleton interface, and can only return -1 from any of its methods. So as a poor user, WHAT am I to do, short of writing a whole Parser and Locator of my own??

I'll answer my own question presently.


(Well I would have, except for the arbitrary, annoying rule that says I can't.)


I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax(under /usr/lib/python2.7/xml/sax/ on my system).

The xml.sax function make_parser() by default creates a real Parser, but what kind of thing is that?
In the source code one finds that it is an ExpatParser, defined in expatreader.py. And...it has its own Locator, an ExpatLocator. But, there is no access to this thing. Much head-scratching came between this and a solution.

  1. write your own ContentHandler, which knows about a Locator, and uses it to determine line numbers
  2. create an ExpatParser with xml.sax.make_parser()
  3. create an ExpatLocator, passing it the ExpatParser instance.
  4. make the ContentHandler, giving it this ExpatLocator
  5. pass the ContentHandler to the parser's setContentHandler()
  6. call parse() on the Parser.

For example:

import sys
import xml.sax

class EltHandler( xml.sax.handler.ContentHandler ):
    def __init__( self, locator ):
        xml.sax.handler.ContentHandler.__init__( self )
        self.loc = locator
        self.setDocumentLocator( self.loc )

    def startElement( self, name, attrs ): pass

    def endElement( self, name ): pass

    def characters( self, data ):
        lineNo = self.loc.getLineNumber()
        print >> sys.stdout, "LINE", lineNo, data

def spit_lines( filepath ):
    try:
        parser = xml.sax.make_parser()
        locator = xml.sax.expatreader.ExpatLocator( parser )
        handler = EltHandler( locator )
        parser.setContentHandler( handler )
        parser.parse( filepath )
    except IOError as e:
        print >> sys.stderr, e

if len( sys.argv ) > 1:
    filepath = sys.argv[1]
    spit_lines( filepath )
else:
    print >> sys.stderr, "Try providing a path to an XML file."

Martijn Pieters points out below another approach with some advantages. If the superclass initializer of the ContentHandler is properly called, then it turns out a private-looking, undocumented member ._locator is set, which ought to contain a proper Locator.

Advantage: you don't have to create your own Locator (or find out how to create it). Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.

Thanks Martijn!

Steve White
  • 373
  • 5
  • 9

2 Answers2

4

The sax parser itself is supposed to provide your content handler with a locator. The locator has to implement certain methods, but it can be any object as long as it has the right methods. The xml.sax.xmlreader.Locator class is the interface a locator is expected to implement; if the parser provided a locator object to your handler then you can count on those 4 methods being present on the locator.

The parser is only encouraged to set a locator, it is not required to do so. The expat XML parser does provide it.

If you subclass xml.sax.handler.ContentHandler() then it'll provide a standard setDocumentHandler() method for you, and by the time .startDocument() on the handler is called your content handler instance will have self._locator set:

from xml.sax.handler import ContentHandler

class MyContentHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        # initialize your handler

    def startElement(self, name, attrs):
        loc = self._locator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Hi Martijn, Where does this self._locator come from? That is precisely the problem. – Steve White Mar 18 '13 at 13:32
  • @SteveWhite: The `xml.sax.handler.ContentHandler` base class sets `self._locator` when `setDocumentLocator()` is called by the parser. You can also implement your *own* `handler.setDocumentLocator(locator)` method of course, but why have a dog and bark yourself? – Martijn Pieters Mar 18 '13 at 13:35
  • Martijn, this _locator dog is documented where? And how do I get access to this member from the "MyContentHandler" subclass? I try it, and of course it says AttributeError: MyContentHandler instance has no attribute '_locator' – Steve White Mar 18 '13 at 13:51
  • @SteveWhite: The [source code](http://hg.python.org/cpython/file/tip/Lib/xml/sax/handler.py) is the only documentation. What Python version are you running this on? If you subclass `xml.sax.handler.ContentHandler` like I did in my code and make sure that the parent `__init__` method is still invoked if you provide your own, then `self._locator` should be set. – Martijn Pieters Mar 18 '13 at 13:54
  • @SteveWhite: A quick trawl through the mercurial repository shows that the `ContentHandler` base class has had a `self._locator` attribute for ever, so I guess you either do not subclass it or you overrode the `__init__` without calling the parent. :-) – Martijn Pieters Mar 18 '13 at 14:03
  • I am running Python 2.7. I see your point now (since your edit). The superclass constructor defines the _locator, which as you said, is set by the parser. It is plain to me that we are experiencing a point-of-view problem here. For the record: the user should not be expected to read the library source code, to determine how to use the library. And members marked with a leading underscore are intended to be *private*. What is worse? Using an undocumented, private variable in the source code, or using a barely-documented public user interface? I vote for the latter. – Steve White Mar 18 '13 at 14:16
  • @SteveWhite: We are exploring some of the more 'crufty' corners of the standard library here; most people nowadays use the ElementTree API when it comes to parsing XML, but that doesn't give access to line numbers while parsing. The public interface *is* documented, and as I stated you can implement your own `.setDocumentLocator()` method. The `self._locator` attribute is a convenience, and should have been documented properly. `_location` is private only to the class, a subclass is certainly free to use it. – Martijn Pieters Mar 18 '13 at 14:20
  • what would speak against a `getDocumentLocator()`? – Steve White Mar 18 '13 at 14:22
  • @SteveWhite: Nothing, you'd be perfectly in your rights to implement your own version. The methods of the `ContentHandler` base class are generally expected to be overridden, they are only there to make sure nothing breaks if you do not implement them yourself. – Martijn Pieters Mar 18 '13 at 14:24
  • Martijn, you misunderstood me. I view the documentation and/or interface as broken. I meant, what would speak against a `getDocumentLocator()` in the standard interface (effectively returning the private `._locator`)? This (and the advice to call the superclass constructor) would have solved the problem for me. – Steve White Mar 18 '13 at 14:37
  • Sorry, misunderstood indeed. The interface documents what the *parser* expects there to be. The parser has no need for a `getDocumentHandler()` method. – Martijn Pieters Mar 18 '13 at 14:38
  • 1
    Right. Again, it's a point-of-view issue. I'm not a parser writer, I am a guy wanting to analyze an XML document. The interface and documentation _could_ have and _should_ have lead quickly to a solution. It failed in this. So the question is, how best to fix it? – Steve White Mar 18 '13 at 14:41
  • @SteveWhite: Stop worrying about it and use my pointers? The documentation about the `setDocumentLocator()` method has been there all along, all I pointed out was that the default base class already implements it and sets `self._locator` if called. – Martijn Pieters Mar 18 '13 at 14:50
  • @SteveWhite: The SAX interfaces are otherwise seen as outdated and archaic these days, and effort goes to the ElementTree API instead, so I don't see anyone going to expend much effort in improving the documentation situation around the SAX module. The module otherwise follows a standard API also implemented in other languages, so perhaps it sets some expectations that you are already familiar with the API. – Martijn Pieters Mar 18 '13 at 14:50
3

This is an old question, but I think that there is a better answer to it than the one given, so I'm going to add another answer anyway.

While there may indeed be an undocumented private data member named _locator in the ContentHandler superclass, as described in the above answer by Martijn, accessing location information using this data member does not appear to me to be the intended use of the location facilities.

In my opinion, Steve White raises good questions about why this member is not documented. I think the answer to those questions is that it was probably not intended to be for public use. It appears to be a private implementation detail of the ContentHandler superclass. Since it is an undocumented private implementation detail, it could disappear without warning with any future release of the SAX library, so relying on it could be dangerous.

It appears to me, from reading the documentation for the ContentHandler class, and specifically the documentation for ContentHandler.setDocumentLocator, that the designers intended for users to instead override the ContentHandler.setDocumentLocator function so that when the parser calls it, the user's content handler subclass can save a reference to the passed-in locator object (which was created by the SAX parser), and can later use that saved object to get location information. For example:

class MyContentHandler(ContentHandler):
    def __init__(self):
        super().__init__()
        self._mylocator = None
        # initialize your handler

    def setDocumentLocator(self, locator):
        self._mylocator = locator

    def startElement(self, name, attrs):
        loc = self._mylocator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)

With this approach, there is no need to rely on undocumented fields.

Some Guy
  • 405
  • 8
  • 15