I have a large XML file which is more than 1 GB. I want to pull out the text from it and write to a text file. The problem is that the file contains some invalid characters like '&' (and I don't even need to write them, I can remove totally). First, I used ElementTree to parse the file as below:
import xml.etree.ElementTree as ET
import re
newFile = open('using_element_tree.xml', 'w', encoding="utf8")
file = "newscor.xml"
context = ET.iterparse(file, events=("start", "end"))
context = iter(context)
for event, elem in context:
tag = elem.tag
tag = re.sub('{http://www.xml-ces.org/schema}', '', tag)
if event == 'start' and (tag == 's' or tag == 'q'):
value = elem.text
if value:
value = value.strip("&, <, >")
value = value.strip()
newFile.write(value)
print(value)
elem.clear()
However, it gave me xml.etree.ElementTree.ParseError: not well-formed (invalid token)
error on this line: for event, elem in context:
This is because of an & character. I could not find a way to escape invalid characters before entering the for loop. I tried to use lxml with a recover=True
parameter, but it has no iterparse() function with this option.
Then, I used BeautifulSoup to parse my file as below:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
newFile = open('using_bs4.xml', 'w', encoding="utf-8")
def only_s_and_q_tags():
return "s" or "q"
s_and_q_tags = SoupStrainer(only_s_and_q_tags())
with open("newscor.xml", encoding="utf-8") as fp:
soup = BeautifulSoup(fp, "xml", parse_only=s_and_q_tags)
for string in soup.strings:
if string not in ['\n', '\r\n']:
print(repr(string))
newFile.write(string)
This did not give me any error, (exited with code 0) and write to the text file, but only a tiny portion of my file. I could not understand what should I do since there is no error.
What should I do to avoid invalid characters and parse my file? Please point me to a direction to handle this issue.