2

So, I assume this is a pretty typical use case, but I can't really find anything about support for this in the lxml documentation. Basically I've got an xml file that consists of a number of distinct xml documents (reviews in particular) The structure is approximately:

<review>
    <!-- A bunch of metadata -->
</review>
<!-- The issue is here -->
<review>
    <!-- A bunch of metadata -->
</review>

Basically, I try to read the file in like so:

import lxml

document = lxml.etree.fromstring(open(xml_file).read())

But I get an error when I do so:

lxml.etree.XMLSyntaxError: Extra content at the end of the document

Totally reasonable error, in fact it is an xml error and should be treated as such, but my question is: how do I get lxml to recognize that this is a list of xml documents and to parse accordingly?

list_of_reviews = lxml.magic(open(xml_file).read())

Is magic a real lxml function?

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144

2 Answers2

1

So, it's a little hacky, but should be relatively robust. There are two main negatives here:

  • Repeated calls to fromstring means that this code isn't extremely fast. About the same speed as parsing each document individually, much slower than if it were all one document
  • Errors are thrown relative to the current location in the document. It would be easy to add relative location support (just adding an accumulator to keep track of current location)

Basically the approach is to find the thrown errors and then parse just the section of the file above the error. If an error that isn't related to the last of a root node is thrown then it is handled like a typical exception.

def fix_xml_list(test_file):
    documents = []
    finished = False
    while not finished:
        try:
            lxml.etree.fromstring(test_file)
        except XMLSyntaxError as e:
            if e.code == 5 and e.position[1] == 1:
                doc_end = e.position[0]
                end_char = find_nth(test_file, '\n', doc_end - 2)
                documents.append(lxml.etree.fromstring(test_file[:end_char]))
                if end_char == len(test_file):
                    finished = True
                test_file = test_file[end_char:]
            else:
                print e
                break
    return documents

def find_nth(doc, search, n=0):
    l = len(search)
    i = -l
    for c in xrange(n + 1):
        i = doc.find(search, i + l)
        if i < 0:
            break
    return i

The find_nth code is shamelessly stolen from this question. It's possible that there aren't many situations where this code is deeply useful, but for me with a large number of slightly irregular documents (very common with academic data) it's invaluable.

Community
  • 1
  • 1
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
-1

XML documents must have a single root element; otherwise, they are not well-formed, and are, in fact, not XML. Conformant parsers cannot parse non-well-formed "XML".

When you construct your single XML document out of multiple documents, simply wrap the disparate root elements in a single root element. Then you'll be able to use standard parsers such as lxml.

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • `Totally reasonable error, in fact it is an xml error and should be treated as such` I'm aware that this isn't correct, but I am not creating the `xml` file. It's coming from an academic mirror that I have no control over and I would rather not have to manually add a root element if possible. – Slater Victoroff Jul 13 '15 at 20:03
  • What you have is **not XML** unless it has a single root element. It doesn't matter if it's outside of your control or if you know it's wrong or if you have the best of intentions. ***It's not XML, and you cannot expect conformant XML parsers to help you until you first repair it***, manually or programmatically. Sorry if this is not what you want to hear, but it's the way it is. Just programmatically wrap a single root around them and parse them as a single XML document, or parse them as separate XML documents then combine the results programmatically after the separate parses. – kjhughes Jul 13 '15 at 20:22
  • 1
    `parse them as separate XML document, then combine results programmatically`: This is what I want, but there doesn't seem to be a straightforward way to do it – Slater Victoroff Jul 13 '15 at 20:41
  • Straightforward is different than standard. There's no standard way because you're not operating within the standard, but I've already offered two straightforward ways: (1) wrap the separate XML trees in a single root element or (2) parse the document separately, extract what you need, and combine the results per your particular needs. There's nothing else to say at this level. Good luck. – kjhughes Jul 13 '15 at 20:59
  • I'm sorry, I don't mean to be bothersome, but as far as 2 is concerned I'm unclear as to what you mean by "parse the document separately". Do you mean split the documents beforehand? Or is there a way to do this automatically? – Slater Victoroff Jul 13 '15 at 21:11
  • Yes, if each of your `` elements are alone well-formed XML, then you can invoke the parser on them individually (as separate files or strings) and gather whatever app-specific data you need into whatever app-specific data structures you like. – kjhughes Jul 13 '15 at 21:40
  • Ah, sorry, that won't really work for me. is just an example of a tag, and only ~60% of the documents are actually documents. I found a workaround that will work excellently for my situation. Without deeper integration into lxml it won't be performant, but it's fast enough for my use case. Thanks for trying to help. – Slater Victoroff Jul 13 '15 at 21:55
  • This doesn't help solve the issue; you can't assume OP can change the way her data is generated. – Bananin Sep 06 '20 at 19:43
  • @Bananin: Your comment on a five year old answer is unconstructive, but if you need similar help, feel free to post a new question, and I'll try to help you there with your particular case. Thanks. – kjhughes Sep 07 '20 at 02:50