I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:
import xml.etree.cElementTree as cElementTree
def main():
context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
context = iter(context)
event, root = context.__next__()
for event, elem in context:
if event == "start":
if elem.tag == 'group':
elem.tail = None
print ( elem.text)
if elem.tag in ['group']:
root.clear()
main()
But it gave me following error in this line for event, elem in context
:
xml.etree.ElementTree.ParseError: not well-formed (invalid token)
To handle this error, I tried to use lxml with recover=True
for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.
Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.
What can I use to avoid invalid characters and parse this large file?