3

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('\n'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

RJJ
  • 184
  • 3
  • 11
  • See this question: http://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces – pholtz Mar 18 '16 at 13:36
  • I think lxml is expecting that the namespace is defined in the document see e.g. [wikipedia](https://en.wikipedia.org/wiki/XML_namespace). If you have access to where the data is generated you can add the expected definition. Else you could strip the namespace away if you don't need it. – syntonym Mar 18 '16 at 13:40
  • Thanks syntonym, how could I strip the namespace away? Interestingly (to me at least) If I change sphinx:docuemnt to sphinxdocument (which I don't really want to do for the sake of efficiency), it works fine but I can't run a replace on the gzip.open(filename.gz) output because I get: xml=gzip.open('00000448069335828601.xml.gz') xml.replace('sphinx:document','sphinxdocument') AttributeError: 'GzipFile' object has no attribute 'replace' – RJJ Mar 18 '16 at 13:52
  • Is that your entire XML document, or is that snippet from in the middle of one? – Robᵩ Mar 18 '16 at 13:55
  • @Rob, It's a snippet - the first few lines and the last line. Each file is about 400 rows of xml, all in the same structure as the first few posted above. thanks – RJJ Mar 18 '16 at 14:01
  • Does the following string appear anywhere in your document: `xmlns:sphinx` ? – Robᵩ Mar 18 '16 at 14:06
  • No, no mention of xmlns:sphinx Rob – RJJ Mar 18 '16 at 14:07

1 Answers1

2

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

  • Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.

  • Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:

    xml2 =etree.iterparse(in_xml, recover=True)
    
Robᵩ
  • 163,533
  • 20
  • 239
  • 308