Parsing huge xml file with invalid characters

Question

I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:

import xml.etree.cElementTree as cElementTree                             

def main(): 
   context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
   context = iter(context)
   event, root = context.__next__()

   for event, elem in context:
     if event == "start":
         if elem.tag == 'group': 
            elem.tail = None
            print ( elem.text)
         if elem.tag in ['group']:
            root.clear()                                               
main()

But it gave me following error in this line for event, elem in context:

xml.etree.ElementTree.ParseError: not well-formed (invalid token)

To handle this error, I tried to use lxml with recover=True for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.

Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.

What can I use to avoid invalid characters and parse this large file?

Try to use lxml with the HTML parser instead of the standard XML parser. The HTML parser is more forgiving with invalid input. Alternatively you can try to use [HTML tidy](http://www.html-tidy.org/) in XML mode to repair the file. There even is a Python package for it. — Tomalak, Nov 22 '17 at 14:39
or you can use perl/python's regex package to pre-process your xml file to get rid of the & sign. — vtd-xml-author, Nov 23 '17 at 00:21
I solved this problem with tidy tool (thanks Tomalak for your comment) Tidy tool converts special character & as &amp. — Arife Kübra, Nov 27 '17 at 18:20

Parsing huge xml file with invalid characters

0 Answers0