I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks