0

I'm writing a program to iterate my Robocopy-Log (>25 MB). It's by far not ready, cause I'm stuck with a problem.

The problem is that after iterating ~1700 lines of my log -> I get an "UnicodeError":

Traceback (most recent call last):
  File "C:/Users/xxxxxx.xxxxxx/SkyDrive/#Python/del_robo2.py", line 6, in <module>
    for line in data:
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7869: character maps to <undefined>

The program looks as follows:

x="Error"
y=1
arry = []
data = open("Ausstellungen.txt",mode="r")
for line in data:
    arry = line.split("\t")
    print(y)
    y=y+1
    if x in arry:
        print("found")
        print(line)
data.close()   

If I reduce the txt file to 1000 lines then the program works. If I delete line 1500 to 3000 and run again, I get again the unicode error around line 1700.

So have I made an error or is this some memory limiting problem of Python?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
polYtoX
  • 3
  • 1
  • 3

2 Answers2

1

Given your data & snippet, I would be surprised if this is a memory issue. It's more likely the encoding: Python is using your system's default encoding to read the file, which is "cp1252" (the default MS Windows encoding), but the file contains byte sequences/bytes which cannot be decoded in that encoding. A candidate for the file's actual encoding might be "latin-1", which you can make Python 3 use by saying

open("Ausstellungen.txt",mode="r", encoding="latin-1")

A possibly similar issue is Python 3 chokes on CP-1252/ANSI reading. A nice talk about the whole thing is here: http://nedbatchelder.com/text/unipain.html

Community
  • 1
  • 1
Hans
  • 2,419
  • 2
  • 30
  • 37
  • THX a lot, setting the encoding to latin was in this case the right thing. My first Python Log checker program slowly gets substance :). – polYtoX Jun 26 '13 at 17:05
0

Python decodes all file data to Unicode values. You didn't specify an encoding to use, so Python uses the default for your system, the cp1252 Windows Latin codepage.

However, that is the wrong encoding for your file data. You need to specify an explicit codec to use:

data = open("Ausstellungen.txt",mode="r", encoding='UTF8')

What encoding to use exactly, is unfortunately something you need to figure out yourself. I used UTF-8 as an example codec.

Be aware that some versions of RoboCopy have problems producing valid output.

If you don't yet know what Unicode is, or want to know about encodings, see:

The reason you see the error crop up for different parts of your file is that your data contains more than one codepoint that the cp1252 encoding cannot handle.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343