0

I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:

import xml.etree.cElementTree as cElementTree                             

def main(): 
   context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
   context = iter(context)
   event, root = context.__next__()

   for event, elem in context:
     if event == "start":
         if elem.tag == 'group': 
            elem.tail = None
            print ( elem.text)
         if elem.tag in ['group']:
            root.clear()                                               
main()

But it gave me following error in this line for event, elem in context:

xml.etree.ElementTree.ParseError: not well-formed (invalid token)

To handle this error, I tried to use lxml with recover=True for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.

Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.

What can I use to avoid invalid characters and parse this large file?

Adam Jaamour
  • 1,326
  • 1
  • 15
  • 31
  • Try to use lxml with the HTML parser instead of the standard XML parser. The HTML parser is more forgiving with invalid input. Alternatively you can try to use [HTML tidy](http://www.html-tidy.org/) in XML mode to repair the file. There even is a Python package for it. – Tomalak Nov 22 '17 at 14:39
  • or you can use perl/python's regex package to pre-process your xml file to get rid of the & sign. – vtd-xml-author Nov 23 '17 at 00:21
  • I solved this problem with tidy tool (thanks Tomalak for your comment) Tidy tool converts special character & as &amp. – Arife Kübra Nov 27 '17 at 18:20

0 Answers0