Parsing large XML file which contains invalid characters

Question

I have a large XML file which is more than 1 GB. I want to pull out the text from it and write to a text file. The problem is that the file contains some invalid characters like '&' (and I don't even need to write them, I can remove totally). First, I used ElementTree to parse the file as below:

import xml.etree.ElementTree as ET
import re

newFile = open('using_element_tree.xml', 'w', encoding="utf8")

file = "newscor.xml"
context = ET.iterparse(file, events=("start", "end"))
context = iter(context)


for event, elem in context:
    tag = elem.tag
    tag = re.sub('{http://www.xml-ces.org/schema}', '', tag)

    if event == 'start' and (tag == 's' or tag == 'q'):
        value = elem.text
        if value:
            value = value.strip("&, <, >")
            value = value.strip()
            newFile.write(value)
            print(value)

    elem.clear()

However, it gave me xml.etree.ElementTree.ParseError: not well-formed (invalid token) error on this line: for event, elem in context: This is because of an & character. I could not find a way to escape invalid characters before entering the for loop. I tried to use lxml with a recover=True parameter, but it has no iterparse() function with this option.

Then, I used BeautifulSoup to parse my file as below:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

newFile = open('using_bs4.xml', 'w', encoding="utf-8")


def only_s_and_q_tags():
    return "s" or "q"

s_and_q_tags = SoupStrainer(only_s_and_q_tags())

with open("newscor.xml", encoding="utf-8") as fp:
    soup = BeautifulSoup(fp, "xml", parse_only=s_and_q_tags)


for string in soup.strings:
    if string not in ['\n', '\r\n']:
        print(repr(string))
        newFile.write(string)

This did not give me any error, (exited with code 0) and write to the text file, but only a tiny portion of my file. I could not understand what should I do since there is no error.

What should I do to avoid invalid characters and parse my file? Please point me to a direction to handle this issue.

I tried to solve my problem using only Python so far, but could not manage it. I solved it using [http://www.html-tidy.org/] (HTML Tidy) for preprocessing my XML file and then parsed it using xml.etree.ElementTree. Thanks! — Kubra, Nov 25 '17 at 13:41

Parsing large XML file which contains invalid characters

0 Answers0