11

I have been trying to parse a file with xml.etree.ElementTree:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError

def analyze(xml):
    it = ET.iterparse(file(xml))
    count = 0
    last = None

    try:        
        for (ev, el) in it:
            count += 1
            last = el

    except ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

    print('count: {0}'.format(count))

This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:

Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    from yparse import analyze; analyze('file.xml')
  File "C:\Python27\yparse.py", line 10, in analyze
    for (ev, el) in it:
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
ParseError: reference to invalid character number: line 1, column 52459

The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.

The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!

This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.

Any ideas?

Aillyn
  • 23,354
  • 24
  • 59
  • 84
  • 3
    You'll have to show some of the XML in question. It's possible you have bad XML, and then when you edit it to narrow it down, it's becoming good XML. Editors can do sneaky things... – Ned Batchelder Oct 07 '11 at 23:00
  • @NedBatchelder The file is really big, making it very difficult for me to upload it. However, I did consider that possibility. So I used Python's file manipulation functions directly to extract portions of the XML and write them to another file. – Aillyn Oct 07 '11 at 23:26
  • Can you show any of the XML that causes this? – Ned Batchelder Oct 07 '11 at 23:30
  • @pessimopoppotamus: According to your error message the error is happening only 52KB into the file ... – John Machin Oct 07 '11 at 23:47
  • @NedBatchelder I am working on an XML chunker that generates valid XML chunks up to a certain chunk size. I'll try to use that to generate a failing XML and upload it. – Aillyn Oct 08 '11 at 00:24
  • If the problem is at character 52459, then hexdump characters 52450 to 52470, and post those. – Ned Batchelder Oct 08 '11 at 00:40
  • @NedBatchelder There's nothing obvious in there. They are all plain alphabetic ASCII characters. – Aillyn Oct 08 '11 at 00:42
  • FWIW, the count is in bytes, not characters. – John Machin Oct 08 '11 at 00:53
  • @NedBatchelder If you really want to see the files... This is the [whole thing](http://tejp.de/files/so/dbdump/so-export-2009-05-01.7z) (200 MB compressed). Both `comments.xml` and `posts.xml` are giving me errors, but the other files are fine. – Aillyn Oct 08 '11 at 00:53
  • @JohnMachin But `file.seek(pos)` and `file.read(size)`, both take arguments is in bytes, right? – Aillyn Oct 08 '11 at 00:55
  • @pessimopoppotamus: Yes, args are bytes. Did you open the files in binary mode ("rb")? – John Machin Oct 08 '11 at 01:10
  • I was wrong: The error message gives offsets in terms of unicode code points. Also the only `\r\n` is at the very end of the file, so text/binary mode is irrelevant for the samp\le files. – John Machin Oct 08 '11 at 07:26

4 Answers4

9

Here are some ideas:

(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?

Do the following for each failing file:

(1) Find out what is in the file at the point that it is complaining about:

text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration

(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/

and edit your question to display your findings.

Update: Here is the minimal xml file that illustrates your problem:

[badcharref.xml]
<a>&#1;</a>

[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
...     print el.tag
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>

Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.

You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc

Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

Output:

comments.xml
6615405 &#x10;
10205764 &#x00;
10213901 &#x00;
10213936 &#x00;
10214123 &#x00;
13292514 &#x03;
...
155656543 &#x1B;
155656564 &#x1B;
157344876 &#x10;
157722583 &#x10;

posts.xml
7607143 &#x1F;
12982273 &#x1B;
12982282 &#x1B;
12982292 &#x1B;
12982302 &#x1B;
12982310 &#x1B;
16085949 &#x1C;
16085955 &#x1C;
...
36303479 &#x12;
36303494 &#xFFFF; <<=== whoops
38942863 &#x10;
...
785292911 &#x08;
801282472 &#x13;
848911592 &#x0B;
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • (0) Occasionally means "with certain files." The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point. – Aillyn Oct 08 '11 at 00:25
  • (1) I did that I couldn't find anything obviously wrong. (2) Can't do it because it's too big. – Aillyn Oct 08 '11 at 00:26
  • I suspected that was the case, but there are no characters like that anywhere near the portion of the file where the error occurs. – Aillyn Oct 08 '11 at 00:36
  • Solutions that involve preprocessing are not a good idea, again because of how big the file is. Ideally, there should be a way for the XML parser to gracefully record the error and go on parsing instead of crashing catastrophically. – Aillyn Oct 08 '11 at 00:38
  • But here's a +1 for the effort. I am done for now. I'll be back later. Hopefully I can verify my chunker works correctly, and I'll upload some samples. – Aillyn Oct 08 '11 at 00:40
  • "couldn't find anything": publish the text around the error position. "it's too big": truncate a copy of the file after 52KB. Ignore all error messages after the first. "crashing": An XML parser is required to reject invalid documents. XML is **NOT** HTML. The parser is not "crashing"; it is doing its job. – John Machin Oct 08 '11 at 00:46
  • This is the [whole thing](http://tejp.de/files/so/dbdump/so-export-2009-05-01.7z) (200 MB compressed). Both `comments.xml` and `posts.xml` are giving me errors, but the other files are fine. – Aillyn Oct 08 '11 at 00:51
  • PLEASE don't makes us download 200MB just to help you. Post only the failing files, separately. – John Machin Oct 08 '11 at 00:56
  • LOL that was what I've been trying to tell you. They are ginormous. – Aillyn Oct 08 '11 at 00:56
8

As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.

In fact, all of these entities appear in the text:

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Indeed, the files were broken. I am doing some preprocessing prior to parsing it and it works as expected. – Aillyn Oct 08 '11 at 06:37
4

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:

except ET.ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

Source: http://effbot.org/zone/elementtree-13-intro.htm

wsisaac
  • 57
  • 1
  • 9
  • 2
    This is a very old question with an accepted answer. If you are not sure if you can add anything to the answer, you should refrain from answering for the sake of answering. – Ideasthete Oct 06 '14 at 05:25
0

I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:

it = ET.iterparse(file(xml))

inside a try & except bracket:

try:
    it = ET.iterparse(file(xml))
except:
    print('iterparse error')

Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.

ntk4
  • 1,247
  • 1
  • 13
  • 18