I have to parse a large amount of XML files and write it to a text file. However, some of the XML files contain special/illegal characters. I have tried using different methods such as Escaping strings for use in XML but I could not get it to work. I know 1 way to solve this is by editing the XML file itself, but there are thousands of files.
import os
from xml.etree import ElementTree
from xml.sax.saxutils import escape
fileNum = 0;
saveFile = open('NewYork_1.txt','w')
for path, dirs, files in os.walk("NewYork_1"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName,'r', encoding='utf-8') as myFile:
# print(myFile.read())
if "&" in myFile:
myFile = myFile.replace("&", "&") #This does not work.
Also, some of the XML files also have unicode in them. There are emojis.
<Thread>
<ThreadID></ThreadID>
<Title></Title>
<InitPost>
<UserID></UserID>
<Date></Date>
<icontent></icontent>
</InitPost>
<Post>
<UserID></UserID>
<Date></Date>
<rcontent></rcontent>
</Post>
</Thread>