XML parsing in python while retaining link to position in original file

Question

I need to extract certain data from XML files, but also know the position where the extracted element was located in the original XML file - as a character offset from file beginning, or a line number + position in that line.

The commonly used python XML libraries don't seem to provide any such functionality.

There is a similar question Obtaining position info when parsing HTML in Python that was solved by writing a custom wrapper around html5lib; but that library won't work for me as the particular data is not HTML.

Are there any XML parsers that keep the element position information, or do I have to roll my own parsing for that?

lxml has `sourceline`, but that only gives you line number – gsnedders Aug 03 '16 at 13:46 — gsnedders, Aug 03 '16 at 13:46

score 1 · Accepted Answer · answered Jan 09 '21 at 18:22

The Expat parser has this functionality. Here's a quick and dirty example:

from xml.parsers.expat import ParserCreate, ExpatError, errors

p = ParserCreate()

def start_element(name, attrs):
    print(f"Start element at line {p.CurrentLineNumber}, col. {p.CurrentColumnNumber}, byte {p.CurrentByteIndex}: {name}")
def end_element(name):
    print(f"End element at line {p.CurrentLineNumber}, col. {p.CurrentColumnNumber}, byte {p.CurrentByteIndex}:", name)
def char_data(data):
    print(f"Character data at line {p.CurrentLineNumber}, col. {p.CurrentColumnNumber}, byte {p.CurrentByteIndex}:", repr(data))
def parse_xml(xml: str):
    try:
        p.StartElementHandler = start_element
        p.EndElementHandler = end_element
        p.CharacterDataHandler = char_data
        p.Parse(xml)
    except ExpatError as err:
        print("Error:", errors.messages[err.code])

parse_xml("<root>abc <tag>ghi</tag>\n def</root>")

and here's the output of this code:

Start element at line 1, col. 0, byte 0: root
Character data at line 1, col. 6, byte 6: 'abc '
Start element at line 1, col. 10, byte 10: tag
Character data at line 1, col. 15, byte 15: 'ghi'
End element at line 1, col. 18, byte 18: tag
Character data at line 1, col. 24, byte 24: '\n'
Character data at line 2, col. 0, byte 25: ' def'
End element at line 2, col. 4, byte 29: root

As you can see, it can print the line number, column number and byte position of each XML element.

score 0 · Answer 2 · answered Aug 03 '16 at 13:45

I don't think such things exists. Most parsers do the parsing first (manipulate the text stream into tokens and then parse it into a tree). By that time, they usually have a good knowledge of where they are in the original stream (this is required to output parsing errors). However once the object tree has been built this information is of small use and no longer accessible into the resulting objects.

A nice and ugly hack (at the same time!) would be to tokenize the XML input, add "position" attribute(s) refering to the original stream position, parse the XML with a regular library and use this attribute(s) later for user information...

Let us know how you did that!

XML parsing in python while retaining link to position in original file

2 Answers2