I have been wrestling without much success with a 2GB XML file on Windows 10 64-bit. I am using some code found on Github here and managed to get it going but have been getting UnicodeErrors on a particular character \u0126 which is a Ħ (a letter used in the Maltese alphabet). The script executes but after the first chunk is saved and the second started, the error comes up.
Edit: The XML file is a Disqus dump from a local portal.
I have followed the advice found in this SO question and set chcp 65001
and the setx PYTHONIOENCODING utf-8
in Windows command prompt and the echo
command checks.
I have tried many of the solutions found in the "Questions that may already have your answer" but I still get the UnicodeError on the same letter. I have also tried a crude data.replace('Ħ', 'H')
and also data.replace('\\u1026', 'H')
but the error still comes up and in the same position. Every time I test something new takes around 5 minutes until the error comes up and I've been struggling for over a day with this nuisance.
I tried reading the file in Notepad++ 64-bit but the program ends up Not responding when I do a search as my 16GB RAM are being eaten and system becomes sluggish.
I have had to change the following part of the whole code's first line to read:
cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt', encoding='utf-8')
and also the second line to read:
with open(filename, 'rt', encoding='utf-8') as xml_file:
but still no juice. I also used errors='replace'
and errors='ignore'
but to no avail.
cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt')
with open(filename, 'rt') as xml_file:
while True:
# Read a chunk
chunk = xml_file.read(CHUNK_SIZE)
if len(chunk) < CHUNK_SIZE:
# End of file
# tell the parser we're done
p.Parse(chunk, 1)
# exit the loop
break
# process the chunk
p.Parse(chunk)
# Don't forget to close our handle
cur_file.close()
Another line I had to edit from the original code is: cur_file.write(data.encode('utf-8'))
and had to change it to:
cur_file.write(data) # .encode('utf-8')) #*
as otherwise the execution was stopping with TypeError: write() argument must be str, not bytes
def char_data(data):
""" Called by the parser when he meet data """
global cur_size, start
wroteStart = False
if start is not None:
# The data belong to an element, we should write the start part first
cur_file.write('<%s%s>' % (start[0], attrs_s(start[1])))
start = None
wroteStart = True
# ``escape`` is too much for us, only & and < ned to be escaped there ...
data = data.replace('&', '&')
data = data.replace('<', '<')
if data == '>':
data = '>'
cur_file.write(data.encode('utf-8')) #*
cur_size += len(data)
if not wroteStart:
# The data was outside of an element, it could be the right moment to
# make the split
next_file()
Any help would be greatly appreciated.
EDIT: added traceback The problem is always when trying to write the file.
Traceback (most recent call last):
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 249, in <module>
main(args[0], options.output_dir)
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 229, in main
p.Parse(chunk)
File "..\Modules\pyexpat.c", line 282, in CharacterData
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 180, in char_data
cur_file.write(data) # .encode('utf-8'))
File "C:\Users\myself\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 6: character maps to <undefined>
Edit: I have tried replacing the offending characters in Notepad++ but another one '\u200e' cropped up so replacing characters is not robust at all.