0

I have been wrestling without much success with a 2GB XML file on Windows 10 64-bit. I am using some code found on Github here and managed to get it going but have been getting UnicodeErrors on a particular character \u0126 which is a Ħ (a letter used in the Maltese alphabet). The script executes but after the first chunk is saved and the second started, the error comes up.

Edit: The XML file is a Disqus dump from a local portal.

I have followed the advice found in this SO question and set chcp 65001 and the setx PYTHONIOENCODING utf-8 in Windows command prompt and the echo command checks.

I have tried many of the solutions found in the "Questions that may already have your answer" but I still get the UnicodeError on the same letter. I have also tried a crude data.replace('Ħ', 'H') and also data.replace('\\u1026', 'H') but the error still comes up and in the same position. Every time I test something new takes around 5 minutes until the error comes up and I've been struggling for over a day with this nuisance.

I tried reading the file in Notepad++ 64-bit but the program ends up Not responding when I do a search as my 16GB RAM are being eaten and system becomes sluggish.

I have had to change the following part of the whole code's first line to read:

cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt', encoding='utf-8')

and also the second line to read:

with open(filename, 'rt', encoding='utf-8') as xml_file:

but still no juice. I also used errors='replace' and errors='ignore' but to no avail.

cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt')

with open(filename, 'rt') as xml_file:
    while True:
        # Read a chunk
        chunk = xml_file.read(CHUNK_SIZE)
        if len(chunk) < CHUNK_SIZE:
            # End of file
            # tell the parser we're done
            p.Parse(chunk, 1)
            # exit the loop
            break
        # process the chunk
        p.Parse(chunk)

# Don't forget to close our handle
cur_file.close()

Another line I had to edit from the original code is: cur_file.write(data.encode('utf-8')) and had to change it to:

cur_file.write(data)  # .encode('utf-8')) #*

as otherwise the execution was stopping with TypeError: write() argument must be str, not bytes

def char_data(data):
""" Called by the parser when he meet data """
global cur_size, start
wroteStart = False
if start is not None:
    # The data belong to an element, we should write the start part first
    cur_file.write('<%s%s>' % (start[0], attrs_s(start[1])))
    start = None
    wroteStart = True
# ``escape`` is too much for us, only & and < ned to be escaped there ...
data = data.replace('&', '&amp;')
data = data.replace('<', '&lt;')
if data == '>':
    data = '&gt;'
cur_file.write(data.encode('utf-8')) #*
cur_size += len(data)
if not wroteStart:
    # The data was outside of an element, it could be the right moment to
    # make the split
    next_file()

Any help would be greatly appreciated.

EDIT: added traceback The problem is always when trying to write the file.

Traceback (most recent call last):
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 249, in <module>
main(args[0], options.output_dir)
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 229, in main
p.Parse(chunk)
File "..\Modules\pyexpat.c", line 282, in CharacterData
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 180, in char_data
cur_file.write(data)  # .encode('utf-8'))
File "C:\Users\myself\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 6: character maps to <undefined>

Edit: I have tried replacing the offending characters in Notepad++ but another one '\u200e' cropped up so replacing characters is not robust at all.

salvu
  • 519
  • 5
  • 14
  • Can you include the traceback? – Josh Lee Aug 30 '17 at 14:57
  • If I interpret UTF-8 correctly, your character is not valid (but would be in UTF-16). In UTF-8 it would have to be represented as two-byte combination 0xC4 0xA6 , see [here](http://www.fileformat.info/info/unicode/char/0126/index.htm) – guidot Aug 30 '17 at 15:03
  • @guidot I tried to change to utf-16 but immediately got: `UnicodeError: UTF-16 stream does not start with BOM`. – salvu Aug 30 '17 at 15:22
  • @JoshLee Added traceback in edit above. – salvu Aug 30 '17 at 15:47

1 Answers1

0

I have been a total noob. I modified the writing to file command to use a try: except block that just changes any unwanted character to the empty string. I know the file would loose some information like this, but at least I can split it and look inside!

This is what I did:

try:
cur_file.write(data)  # .encode('utf-8')) # this was part of the original line
except UnicodeEncodeError:
    data = ''
    cur_file.write(data)
salvu
  • 519
  • 5
  • 14