2

Hello I work with the oreilly book from Allen Downey to learn Python3.x. There is in chapter 9 the example to work with a word-list that is in a file from the Moby Project.

https://en.wikipedia.org/wiki/Moby_Project

https://web.archive.org/web/20170930060409/http://icon.shef.ac.uk/Moby/

I read the german.txt file with following Python lines.

with open("german.txt") as log:
        for line in log:
                word = line.strip()
                if len(word) > 20:
                        print(word)


Some words are read, but there come a break and I get this lines.

Amtsueberschreitungen
Traceback (most recent call last):
  File "einlesen.py", line 8, in <module>
    for line in log:
  File "/home/alexander/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 394: invalid start byte

What symbol is meant? How can I handle this with the python code.

Thanks

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153

3 Answers3

2

According to the documentation of open():

if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So how the file will be read is different for everyone. To guarantee that the file is read correctly you need to specify the correct encoding.

According to the documentation of the Moby Project on Wikipedia, "some non-ASCII accented characters remain, represented using Mac OS Roman encoding". In the documentation of the Python codecs module you can find the correct name for that codec, which is 'mac_roman'. So, you could use the following code, which does not result in a decoding error:

with open("german.txt", 'rt', encoding='mac_roman') as log:
    for line in log:
        word = line.strip()
        if len(word) > 20:
            print(word)

UPDATE

Despite the documentation, the file does not seem to be encoded using Mac OS Roman encoding. I decoded the file using all possible encodings and compared the results. There are only 9 non-ASCII words in the list, and the word "André" seems right, as pointed out in another answer. The following is a list of possible encodings (that did not fail, and included the word "André") and the 9 non-ASCII words decoded according to that encoding:

encodings: cp437, cp860, cp861, cp863, cp865
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p≥ange

encodings: cp720
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, pٌange

encodings: cp775
words: André, Attaché, Chāteau, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhōnetal, p“ange

encodings: cp850, cp858
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p‗ange

encodings: cp852
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p˛ange

For all the above-mentioned encodings, the first 8 words are the same when decoded. Only for the last word there are 9 different results.

Based on this results, I think that the cp720 encoding was used. However, I don't recognize the last word from the list, so I can't tell for sure. It's up to you to decide which decoding is most suitable for you.

wovano
  • 4,543
  • 5
  • 22
  • 49
  • 1
    +1 thorough, and demonstrating how difficult it can be to accurately determine _exactly_ which 8-bit encoding has been used. – snakecharmerb Jun 29 '19 at 20:21
0

As per my comment above, this looks like an encoding issue which, after testing it, is pretty much the case.

Detecting the encoding by using the chardet module gives:

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Which is not Python's default UTF8-encoding. In order to read files with other encodings, you need to specifiy the desired encoding when reading the file by using the encoding= parameter of the open(filename, mode, encoding, ...) function.

Since the encoding might not be known in advance, it is quite handy to use chardet's UniversalDetector to determine the file encoding and then pass it to the file reading like this:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
detector.reset()
with open('german.txt', 'rb') as file:
    for line in file:
        detector.feed(line)
        if detector.done:
            break
detector.close()
encoding = detector.result
print(encoding)

with open("german.txt", encoding=encoding) as log:
    for line_num, line in enumerate(log):
        word = line.strip()
        if len(word) > 20:
            print(line_num, word)

Note: Works fine on my machine with German locales (MacOS 10.10.5 with Python 3.6.2) and got the same error before detecting the encoding as the OP did. My locales are:

LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
albert
  • 8,027
  • 10
  • 48
  • 84
  • 1
    +1 for using chardet, although in this case I think chardet is right not to be 100% confident about its findings. – snakecharmerb Jun 29 '19 at 15:57
  • 1
    It's still guessing. And since only 9 of the 159809 words are non-ASCII words, it might be easy to get a high confidence, even with an incorrect encoding. If you look at the actual file, you'll see that decoding it with Windows-1252 will result in the following 9 non-ASCII words: Andr‚, Attach‚, Chƒteau, Conf‚rencier, C‚zanne, Faberg‚, L‚vi-Strauss, Rh“netal, pòange. This does not seem correct to me! – wovano Jun 29 '19 at 16:49
0

Guessing the correct encoding for a file can be tricky. Let's start by opening the file in binary mode, finding the offending byte, and examining the surrounding characters.

>>> with open('german.txt', 'rb') as f:
...     bs = f.read()
... 
>>> bs.find(b'\x82')
24970
>>> bs[24960:24980]
b'nebel\rAndr\x82\rAndy\rAne'

So the byte b'\x82' is the final letter in a five-letter word that starts with 'Andr'.

Looking up b'\x82' in this page (by Stack Overflow user @tripleee), we can see what characters it might match. The most likely match, I think, is 'é', giving us the proper name 'André'. Cross checking against the list of Python encodings, the most suitable encoding is cp850, a legacy encoding for Western European languages.

This code will read the file without error:

>>> with open('german.txt', encoding='cp850') as f:
...     for line in f:
...         # do things with line

If you find any "unusual" characters in the data you might need to try alternative encodings. This is because it's quite possible for 8-bit encodings to decode a byte successfully, but the result is meaningless. For example, if we decode from cp1252:

>>> b'Andr\x82'.decode('cp1252')
'Andr‚'
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153