According to the documentation of open()
:
if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So how the file will be read is different for everyone. To guarantee that the file is read correctly you need to specify the correct encoding.
According to the documentation of the Moby Project on Wikipedia, "some non-ASCII accented characters remain, represented using Mac OS Roman encoding". In the documentation of the Python codecs
module you can find the correct name for that codec, which is 'mac_roman'. So, you could use the following code, which does not result in a decoding error:
with open("german.txt", 'rt', encoding='mac_roman') as log:
for line in log:
word = line.strip()
if len(word) > 20:
print(word)
UPDATE
Despite the documentation, the file does not seem to be encoded using Mac OS Roman encoding. I decoded the file using all possible encodings and compared the results. There are only 9 non-ASCII words in the list, and the word "André" seems right, as pointed out in another answer. The following is a list of possible encodings (that did not fail, and included the word "André") and the 9 non-ASCII words decoded according to that encoding:
encodings: cp437, cp860, cp861, cp863, cp865
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p≥ange
encodings: cp720
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, pٌange
encodings: cp775
words: André, Attaché, Chāteau, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhōnetal, p“ange
encodings: cp850, cp858
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p‗ange
encodings: cp852
words: André, Attaché, Château, Conférencier, Cézanne, Fabergé, Lévi-Strauss, Rhônetal, p˛ange
For all the above-mentioned encodings, the first 8 words are the same when decoded. Only for the last word there are 9 different results.
Based on this results, I think that the cp720 encoding was used. However, I don't recognize the last word from the list, so I can't tell for sure. It's up to you to decide which decoding is most suitable for you.