I got a large database in one xml file and I need to process the data in it (using python).
I tried to parse it with xml
library using xml.dom.minidom
and (in another script) xml.etree.ElementTree
and then get deep tag by tag until the tag <s>
, and then iterate over the tags I need (<t>
) to retrieve the relevant data.
My problem is that the file is really large (217 MB) and I cannot parse or load it. I keep getting a memory error and it is not even loaded.
The structure of the file is this:
<corpus>
<head>
...
</head>
<body>
<s id="s1">
<graph>
<terminals>
<t id="s1_1" ex="bla" ex2="bla2"/>
<t id="s1_2" ex="bla" ex2="bla2"/>
<t id="s1_3" ex="bla" ex2="bla2"/>
</terminals>
</graph>
</s>
<s id="s2">
<graph>
<terminals>
<t id="s2_1" ex="bla" ex2="bla2"/>
<t id="s2_2" ex="bla" ex2="bla2"/>
<t id="s12_3" ex="bla" ex2="bla2"/>
</terminals>
</graph>
</s>
.... # more than 50K <s> tags and almost 1M <t> tags
</body>
</corpus>
What I really need is to retrieve all the <t/>
tags and to store the data of their attributes in a csv or something, but the computer cannot parse the large file.
I would be very happy to read your advice.
Thank you very much!