Python lxml error "namespace not defined."

Question

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('\n'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

See this question: http://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces — pholtz, Mar 18 '16 at 13:36
I think lxml is expecting that the namespace is defined in the document see e.g. [wikipedia](https://en.wikipedia.org/wiki/XML_namespace). If you have access to where the data is generated you can add the expected definition. Else you could strip the namespace away if you don't need it. — syntonym, Mar 18 '16 at 13:40
Thanks syntonym, how could I strip the namespace away? Interestingly (to me at least) If I change sphinx:docuemnt to sphinxdocument (which I don't really want to do for the sake of efficiency), it works fine but I can't run a replace on the gzip.open(filename.gz) output because I get: xml=gzip.open('00000448069335828601.xml.gz') xml.replace('sphinx:document','sphinxdocument') AttributeError: 'GzipFile' object has no attribute 'replace' — RJJ, Mar 18 '16 at 13:52
Is that your entire XML document, or is that snippet from in the middle of one? — Robᵩ, Mar 18 '16 at 13:55
@Rob, It's a snippet - the first few lines and the last line. Each file is about 400 rows of xml, all in the same structure as the first few posted above. thanks — RJJ, Mar 18 '16 at 14:01
Does the following string appear anywhere in your document: `xmlns:sphinx` ? — Robᵩ, Mar 18 '16 at 14:06

Robᵩ · Answer 1 · 2016-03-18T14:19:04.543

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
```
xml2 =etree.iterparse(in_xml, recover=True)
```

Python lxml error "namespace not defined."

1 Answers1