Python Reading File and Identifying Source of UnicodeDecodeError

Question

I am trying to read a text file using the following statement:

with open(inputFile) as fp:  
    for line in fp:
        if len(line) > 0:
            lineRecords.append(line.strip());

The problem is that I get the following error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>

My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.

Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?

(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)

Thanks.

Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.

Possible duplicate of [UnicodeEncodeError: 'charmap' codec can't encode - character maps to , print function](https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined) — razimbres, Mar 05 '19 at 22:38
Why don't you try to read the file in binary mode and print the chars at position 6875:6885 so you can see the bad char at position 6880 (from the output)? — patito, Mar 05 '19 at 22:49
I don't need to read the file in binary mode to find out what the character is as Python provides that information. My objective is to find out what record has the bad character. Without that information, the byte offset information that Python provides is utterly worthless. — Jim, Mar 06 '19 at 04:05

score 2 · Answer 1 · answered Mar 05 '19 at 23:38

2

I think the only way is to track the line number separately and output it yourself.

with open(inputFile) as fp:
    num = 0
    try:
        for num, line in enumerate(fp):
            if len(line) > 0:
                lineRecords.append(line.strip())
    except UnicodeDecodeError as e:
        print('Line ', num, e)

answered Mar 05 '19 at 23:38

Mark Ransom

299,747
42
398
622

Hi Mark, Thanks for this code. I was sure that it would work but the output I get back is puzzling. On run 1 the output is: Line 9 'charmap' codec can't decode byte 0x9d in position 3649: So I delete the lines before line 9 and run the program again and the output becomes: Line 0 'charmap' codec can't decode byte 0x9d in position 4490 I expected to see line 0 but I did not expect the position value to change. And note that line 0 only has 955 characters in it. Looks like "for line in fp' has nothing to do with reading records. – Jim Mar 06 '19 at 04:44
@Jim that's unfortunate. The only other thing I could suggest is to read the file in binary mode and decode it yourself, but that doesn't let you read it line-by-line. – Mark Ransom Mar 06 '19 at 05:01
@Jim this is a perfect example of [the law of leaky abstractions](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/). – Mark Ransom Mar 06 '19 at 05:03
Hi Mark, I am still puzzling over this. I tried using readlines() but it behaves just the same as the for enumerate() code - which is reassuring in once sense. That's two different techniques producing the same output. I've got to figure that I'm missing something. – Jim Mar 06 '19 at 05:05
@Jim I think reading a text file in Python occurs in multiple pieces. First an internal buffer is filled from the file, then that buffer is decoded, and finally the decoded buffer is split into lines. The offset reported in the exception is relative to the start of an internal buffer that you can't see. – Mark Ransom Mar 06 '19 at 12:56
Hi Mark, I do believe that you are correct. I've also tried the old non-pythonic way of reading a file and it too has the same issue. It certainly does look like Python is reading more than a single line at a time and when it encounters a bad char within that block, it throws the error with the position being relative to the start of that particular block. Thanks for taking the time to help me out on this. – Jim Mar 07 '19 at 01:08

score 0 · Answer 2 · answered Mar 05 '19 at 22:51

0

You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:

with open(inputFile) as fp:
    print(len(fp.read(6880).encode()))

answered Mar 05 '19 at 22:51

blhsing

91,368
6
71
106

But this solution offers no information as to what record contains the bad character. And the 6880 reference is only valid for that single record and not any other. – Jim Mar 06 '19 at 04:08

score 0 · Answer 3 · answered Mar 06 '19 at 04:10

0

I have faced this issue before and the easiest fix is to open file in utf8 mode

with open(inputFile, encoding="utf8") as fp:

answered Mar 06 '19 at 04:10

Majd Msahel

56
1
1

That's not going to help if the file isn't actually UTF-8 encoded. The question was about identifying where the offending character is in the file, not trying to fix it blindly. – Mark Ransom Mar 06 '19 at 04:13
Hi Majd, I actually did that as well. In that case I wound up getting a different set of encoding errors. So whether using the encoding option or not, what I'd like to get to is an error message of the sort "on record N the character as the position P is invalid." – Jim Mar 06 '19 at 04:53
what python version you are using? – Majd Msahel Mar 06 '19 at 06:05
Python v3.5.2 though I doubt that it is a relevant factor in this situation. – Jim Mar 07 '19 at 01:05

Python Reading File and Identifying Source of UnicodeDecodeError

3 Answers3