How to fix inconsistent UnicodeEncodeError reading/opening file

Question

I've recently picked up Python, and have inherited some Python 2.7 code in which I am trying to resolve some exceptions. For each command-line argument (directory name), the main code calls imported functionality to process some XML files there, then the code in question initializes a class instance by reading lines from a "matches" file and setting up a dictionary to map "match" strings to line numbers in the file (this would be used later to tag string matches found in the XML).

The matches file itself never changes; it's always the same file being loaded from disk. It's in UTF-8 format, about 12 thousand lines, 100 of which contain non-ASCII characters. Most of the time, this loading happens without a hitch. Occasionally, however, an exception is thrown from this code: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 12: ordinal not in range(128)". The character code and position also vary quite a bit. I've verified that the filename and file contents being read never change; it's just that occasionally, Python decides to throw exceptions for the same data it can usually read!

I noticed that these exceptions always seem to happen after the imported module reports errors processing an internal list, but how that makes a different module suddenly throw exceptions on previously-acceptable data is a mystery.

Here is the error I originally saw:

ProcessXML - entry 52 bad list string, skipping...
class StrMatcher.__init__ called
Traceback (most recent call last):
  File "/home/tests/my_main.py", line 19, in my_main
    obj = StrMatcher("/home/tests/match_strings.txt")
  File "/home/tests/my_common.py", line 20, in __init__
    for line in f:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 12: ordinal not in range(128)

From this code:

class StrMatcher:
    def __init__(self, matches_file):
        print("StrMatcher.__init__ called")
        self.line_numbers = {}
        i = 1
        with open(matches_file, 'r') as f:
            for line in f:
                self.line_numbers[line.strip()] = i
                i += 1

(Unlike other questions about Python Unicode errors, this error happens inconsistently despite always processing the exact same data.)

I printed out 'i', and saw that the error happens after the last line has been read from the file (so presumably it's trying to handle the EOF). I printed out the type of 'line', and noticed it was "<type 'str'>", so I tried changing the 'open' line:

        with io.open(matches_file, mode='r', encoding="utf-8") as f:

This changes 'line' into "<type 'unicode'>", and also changes the error output (for those cases with errors; otherwise, it works just as well as the previous code):

ProcessXML - entry 52 bad list string, skipping...
class StrMatcher.__init__ called
Traceback (most recent call last):
  File "/home/tests/my_main.py", line 19, in my_main
    obj = StrMatcher("/home/tests/match_strings.txt")
  File "/home/tests/my_common.py", line 19, in __init__
    with io.open(matches_file, mode='r', encoding="utf-8") as f:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 12: ordinal not in range(128)

Different input directories that exhibit the 'ProcessXML' error (and/or possibly different runs of the same directories, not sure) make the exception character/range jump around, e.g.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 43: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 91: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u200b' in position 52: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 48-50: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 37: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 32: ordinal not in range(128)

What is going on? What and why is the 'ascii' codec trying to encode, particularly when opening the file with "utf-8" mode?

I assume that some state is inadvertently changed by the imported module in the error case, but I'm surprised that anything can make reading lines or opening files in another module throw a UnicodeEncodeError when the exact same file/data can be opened and read successfully by the same code in other circumstances! Is it a Python 2.7 bug? Or some global state in the 'open' code? Or...?

Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) — Ulug Toprak, Apr 23 '19 at 08:40
@UlugToprak No, in this case the code isn't trying (AFAICT) to convert anything into a str type, and the problem doesn't happen consistently even though the data being read never changes. — barnabas, Apr 23 '19 at 08:56

How to fix inconsistent UnicodeEncodeError reading/opening file

0 Answers0