I have a memory problem with parsing the large XML file.
The file looks like (just first few rows):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE raml SYSTEM 'raml20.dtd'>
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="actual">
<header>
<log dateTime="2019-02-05T19:00:18" action="created" appInfo="ActualExporter">InternalValues are used</log>
</header>
<managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M-1" id="366">
<p name="linkedMrsiteDN">PL/TE-2/p>
<p name="name">Name of street</p>
<list name="PiOptions">
<p>0</p>
<p>5</p>
<p>2</p>
<p>6</p>
<p>7</p>
<p>3</p>
<p>9</p>
<p>10</p>
</list>
<p name="btsName">4251</p>
<p name="spareInUse">1</p>
</managedObject>
<managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M10" id="958078">
<p name="linkedMrsiteDN">PLMN-PLMN/MRSITE-138</p>
<p name="name">Street 2</p>
<p name="btsName">748</p>
<p name="spareInUse">3</p>
</managedObject>
<managedObject class="MRBTS" version="MRBTS17A_1701_003" distName="PL/M21" id="1482118">
<p name="name">Stree 3</p>
<p name="btsName">529</p>
<p name="spareInUse">4</p>
</managedObject>
</cmData>
</raml>
And I am using xml eTree Element parser, but with a file over 4GB and 32 GB of RAM on machine, I'm running out of memory. Code I'm using:
def parse_xml(data, string_in, string_out):
"""
:param data: xml raw file that need to be processed and prased
:param string_in: string that should exist in distinguish name
:param string_out: string that should not exist in distinguish name
string_in and string_out represent the way to filter level of parsing (site or cell)
:return: dictionary with all unnecessary objects for selected technology
"""
version_dict = {}
for child in data:
for grandchild in child:
if isinstance(grandchild.get('distName'), str) and string_in in grandchild.get('distName') and string_out not in grandchild.get('distName'):
inner_dict = {}
inner_dict.update({'class': grandchild.get('class')})
inner_dict.update({'version': grandchild.get('version')})
for grandgrandchild in grandchild:
if grandgrandchild.tag == '{raml20.xsd}p':
inner_dict.update({grandgrandchild.get('name'): grandgrandchild.text})
elif grandgrandchild.tag == '{raml20.xsd}list':
p_lista = []
for gggchild in grandgrandchild:
if gggchild.tag == '{raml20.xsd}p':
p_lista.append(gggchild.text)
inner_dict.update({grandgrandchild.get('name'): p_lista})
if gggchild.tag == '{raml20.xsd}item':
for gdchild in gggchild:
inner_dict.update({gdchild.get('name'): gdchild.text})
version_dict.update({grandchild.get('distName'): inner_dict})
return version_dict
I have tried with iterparse, with root.clear(), but nothing really helps. I heard that DOM parsers are slower ones, but SAX gives me an error:
ValueError: unknown url type: '/development/data/raml20.dtd'
Not sure why. If anyone has any suggestion on how to improve way and performance, I will be really thankful. I there is a need for bigger XML samples, I am willing to provide it.
Thanks in advance.
EDIT:
Code I tried after the first answer:
import xml.etree.ElementTree as ET
def parse_item(d):
# print(d)
# print('---')
a = '<root>'+ d + '</root>'
tree = ET.fromstring(a)
outer_dict_yield = {}
for elem in tree:
inner_dict_yield = {}
for el in elem:
if isinstance(el.get('name'), str):
inner_dict_yield.update({el.get('name'): el.text})
inner_dict.update({'version': elem.get('version')})
# print (inner_dict_yield)
outer_dict_yield.update({elem.get('distName'): inner_dict_yield})
# print(outer_dict_yield)
return outer_dict_yield
def read_a_line(file_object):
while True:
data = file_object.readline()
if not data:
break
yield data
min_data = ""
inside = False
f = open('/development/file.xml')
outer_main = {}
counter = 1
for line in read_a_line(f):
if line.find('<managedObject') != -1:
inside = True
if inside:
min_data += line
if line.find('</managedObject') != -1:
inside = False
a = parse_item(min_data)
counter = counter + 1
outer_main.update({counter: a})
min_data = ''