I have a pretty big XML file and I need to get all the nodes (different companies information) that contain a specific parameter. XML is about 12 GB unpacked.
<Companies xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ...>
<Company id="782634892" source="abcd">
<attribution>abcde</attribution>
<name xml:lang="en">company name</name>
<Phones>
<Phone type="phone" hide="0">
<formatted>+1800111</formatted>
<country>1</country>
<prefix>800</prefix>
<number>111</number>
</Phone>
</Phones>
<Rubrics>
<rubric ref="184107947"/>
</Rubrics>
There is a bunch of stuff more but that doesn't matter.
My code is pretty simple:
file = open('companies2.xml')
data = file.read()
dom = parseString(data)
key = dom.getElementsByTagName("Company")
for elements in key:
rubricsArray = elements.getElementsByTagName("Rubrics")[0].getElementsByTagName("rubric")
for rub in rubricsArray:
if rub.attributes["ref"].value == '32432793389':
print elements.toxml()
It works on a smaller file I made for testing. But here it doesn't.
Traceback (most recent call last):
File "./XMLparse.py", line 29, in <module>
dom = parseString(data)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
OverflowError: size does not fit in an int
Any ideas how to make it work? I tried to use gz file but zmore creates some random first line:
------> companies2.xml.gz <------
And DOM won't parse it. So i gunzipped it. Thanks in advance for any help.