0

I got a large database in one xml file and I need to process the data in it (using python).

I tried to parse it with xml library using xml.dom.minidom and (in another script) xml.etree.ElementTree and then get deep tag by tag until the tag <s>, and then iterate over the tags I need (<t>) to retrieve the relevant data.

My problem is that the file is really large (217 MB) and I cannot parse or load it. I keep getting a memory error and it is not even loaded.

The structure of the file is this:

<corpus>

<head>
...
</head>

<body>
  <s id="s1">
    <graph>
      <terminals>
        <t id="s1_1" ex="bla" ex2="bla2"/>
        <t id="s1_2" ex="bla" ex2="bla2"/>
        <t id="s1_3" ex="bla" ex2="bla2"/>
      </terminals>
    </graph>
  </s>

  <s id="s2">
    <graph>
      <terminals>
        <t id="s2_1" ex="bla" ex2="bla2"/>
        <t id="s2_2" ex="bla" ex2="bla2"/>
        <t id="s12_3" ex="bla" ex2="bla2"/>
      </terminals>
    </graph>
  </s>

.... # more than 50K <s> tags and almost 1M <t> tags

</body>

</corpus>

What I really need is to retrieve all the <t/> tags and to store the data of their attributes in a csv or something, but the computer cannot parse the large file.

I would be very happy to read your advice.

Thank you very much!

  • How are you parsing it? – zvone Jan 07 '21 at 21:50
  • See [using a sax parser](https://stackoverflow.com/questions/12263029/how-to-get-results-from-xml-sax-parser-in-python) or alternatively [pulldom](https://www.ibm.com/developerworks/xml/library/x-tipulldom/x-tipulldom-pdf.pdf). –  Jan 07 '21 at 21:54
  • @zvone I parsed it with the ```xml``` library. I added it now to the post (2nd paragraph). thanks for asking! @Justin Ezequiel I will try it. thanks! – user10369333 Jan 07 '21 at 23:37
  • Please show your code. Look also into `iterparse`. Quite a few online blogs and SO posts. Give it a try and come back with any issues. – Parfait Jan 08 '21 at 00:58

1 Answers1

0

Try this xml library. pip install simplified_scrapy

from simplified_scrapy import SimplifiedDoc, utils

doc = SimplifiedDoc()
doc.loadFile('test.xml', lineByline=True) # Read data line by line

for s in doc.getIterable('s'): 
    print (s.selects('t'))
dabingsou
  • 2,469
  • 1
  • 5
  • 8