Python3: Why i'm getting a UnicodeDecodeError or is this a Memory issue?

Question

I'm writing a program to iterate my Robocopy-Log (>25 MB). It's by far not ready, cause I'm stuck with a problem.

The problem is that after iterating ~1700 lines of my log -> I get an "UnicodeError":

Traceback (most recent call last):
  File "C:/Users/xxxxxx.xxxxxx/SkyDrive/#Python/del_robo2.py", line 6, in <module>
    for line in data:
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7869: character maps to <undefined>

The program looks as follows:

x="Error"
y=1
arry = []
data = open("Ausstellungen.txt",mode="r")
for line in data:
    arry = line.split("\t")
    print(y)
    y=y+1
    if x in arry:
        print("found")
        print(line)
data.close()

If I reduce the txt file to 1000 lines then the program works. If I delete line 1500 to 3000 and run again, I get again the unicode error around line 1700.

So have I made an error or is this some memory limiting problem of Python?

You should pass the `encoding` argument to `open`(if you are on python3, in python2 use `codecs.open`). — Bakuriu, Jun 26 '13 at 13:05

score 1 · Accepted Answer · edited May 23 '17 at 12:17

Given your data & snippet, I would be surprised if this is a memory issue. It's more likely the encoding: Python is using your system's default encoding to read the file, which is "cp1252" (the default MS Windows encoding), but the file contains byte sequences/bytes which cannot be decoded in that encoding. A candidate for the file's actual encoding might be "latin-1", which you can make Python 3 use by saying

open("Ausstellungen.txt",mode="r", encoding="latin-1")

A possibly similar issue is Python 3 chokes on CP-1252/ANSI reading. A nice talk about the whole thing is here: http://nedbatchelder.com/text/unipain.html

THX a lot, setting the encoding to latin was in this case the right thing. My first Python Log checker program slowly gets substance :). — polYtoX, Jun 26 '13 at 17:05

score 0 · Answer 2 · edited Mar 20 '17 at 10:18

Python decodes all file data to Unicode values. You didn't specify an encoding to use, so Python uses the default for your system, the cp1252 Windows Latin codepage.

However, that is the wrong encoding for your file data. You need to specify an explicit codec to use:

data = open("Ausstellungen.txt",mode="r", encoding='UTF8')

What encoding to use exactly, is unfortunately something you need to figure out yourself. I used UTF-8 as an example codec.

Be aware that some versions of RoboCopy have problems producing valid output.

If you don't yet know what Unicode is, or want to know about encodings, see:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

The reason you see the error crop up for different parts of your file is that your data contains more than one codepoint that the cp1252 encoding cannot handle.

THX for the wealth of information. Choosing UTF8 didn't work for me, but you clearly wrote that maybe the log isn't in the right format. — polYtoX, Jun 26 '13 at 17:10
@polYtoX: If you used `/unilog`, then the output would (should) be in UTF16. — Martijn Pieters, Jun 26 '13 at 17:13

Python3: Why i'm getting a UnicodeDecodeError or is this a Memory issue?

2 Answers2