0

I am trying to read a XML file in python using xml.etree but sometimes for some files I get memory error while parsing the file. My XML file size is 912Mb, Is the issue related to file size?

Code:

from xml.etree import ElementTree
with open('F:\\Reports\\Logs\\AppPerfect_States\\TG1_GM\\Result_TG1_V16.xml', 'rt') as f1:
tree = ElementTree.parse(f1)

Error:

Traceback (most recent call last):
File "<pyshell#3>", line 2, in <module>
tree = ElementTree.parse(f1)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 653, in parse
data = source.read(65536)
MemoryError

Update: As per many suggestion I tried lxml

Code:

 from lxml import etree
   context = etree.iterparse('F:\\Reports\\Logs\\AppPerfect_States\\TG1_GM\\Result_TG1_V16.xml',tag = "document")
   for event, element in context:
    for child in element:
        print child.tag, child.text
    element.clear()

Error:

C:\Python27\python.exe "F:/Py Projects/V16_AUTO/test1/xmlparsingtest1.py"
Traceback (most recent call last):
  File "F:/Py Projects/V16_AUTO/test1/xmlparsingtest1.py", line 3, in <module>
    for event, element in context:
  File "iterparse.pxi", line 207, in lxml.etree.iterparse.__next__ (src\lxml\lxml.etree.c:126137)
lxml.etree.XMLSyntaxError: unknown error, line 7530730, column 33

Update2: Tried cElementTree

Code:

import xml.etree.cElementTree as etree
xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'
context = etree.iterparse(xmL, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
    if event == 'TasksReportNode':
        print elem.tag
        print elem.text
        root.clear()

Error:

Exception MemoryError:  in  ignored
Exception MemoryError:  in  ignored
Exception MemoryError:  in  ignored
Exception MemoryError:  in  ignored
Exception MemoryError:  in  ignored
MemoryError
siddhu619
  • 61
  • 4
  • 16
  • As this https://docs.python.org/2/library/xml.etree.elementtree.html document suggests you don't have to open file with xml with `open`. Just do: `import xml.etree.ElementTree as ET tree = ET.parse('F:\\Reports\\Logs\\AppPerfect_States\\TG1_GM\\Result_TG1_V16.xml') root = tree.getroot()` – Alexey Smirnov Mar 15 '16 at 07:32
  • Does the error always happen on the same file? Have you checked that the file is good xml? – DisappointedByUnaccountableMod Mar 15 '16 at 08:57
  • @AlexeySmirnov I have tried the your suggestion and getting the same error. – siddhu619 Mar 15 '16 at 09:08
  • @barny I have checked the file and XML file is good and I think the error is because of size of the file. – siddhu619 Mar 15 '16 at 09:09
  • @siddhu619 If it's a memmory issue, you may consult this question and its answer http://stackoverflow.com/questions/7697710/python-running-out-of-memory-parsing-xml-using-celementtree-iterparse – Alexey Smirnov Mar 15 '16 at 09:11
  • @Alexey I have 8GB of RAM and before executing the code I had 4GB of RAM available but when the execution started the memory spiked to 6GB and leaving me with 2GB, so I think memory is a not issue. – siddhu619 Mar 15 '16 at 09:39

2 Answers2

0
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file="xyz.xml")

for elem in tree.iter():
    print elem.attrib

Try this code to read your file. It may help.

Anuj Bhasin
  • 610
  • 1
  • 8
  • 18
  • @sidddhu http://stackoverflow.com/questions/324214/what-is-the-fastest-way-to-parse-large-xml-docs-in-python check this application of interparse function of the cElementTree module. – Anuj Bhasin Mar 15 '16 at 08:05
0

Here is what I have tried: I have used lxml

from lxml import etree
xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'


context = etree.iterparse(xmL,  events=("start", "end"),)
for event, element in context:
if element.tag == 'TasksReportNode':
    for child1 in element:
        for child2 in child1:
        if child2.get("RowCount") == "0":
            for child3 in child2:
            print(child3.tag, child3.text)
element.clear() # discard the element
del context

I am able to parse all tags and retrieve the required data.

siddhu619
  • 61
  • 4
  • 16