'utf-8' codec can't decode byte 0x80

Question

I'm trying to download BVLC-trained model and I'm stuck with this error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

I think it's because of the following function (complete code)

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

Any idea how to fix this?

The error message is quite clear. Either your file is not UTF8 at all, or it is damaged. — Jongware, Apr 24 '16 at 16:49
That is what I got when I try to print `f` `<_io.TextIOWrapper name='models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel' mode='r' encoding='utf8'>` — Ehab AlBadawy, Apr 24 '16 at 16:51
Interesting. So what happens when you specify the file encoding explicitly? Something like `open(filename, 'r', encoding='utf8')` ? — Phil Cote, Apr 24 '16 at 16:53
I have tried to modify the 2nd line with this `with open(filename, 'r', encoding='utf8') as f:` but I got the same err — Ehab AlBadawy, Apr 24 '16 at 16:55
No, do *not* tell Python it is UTF8. Unless you are sure it ought to be - but Python is telling you it is *not* valid UTF8, but something else. Open the file with a good code editor and see what's inside. — Jongware, Apr 24 '16 at 16:57
I can't it's a .caffemodel file and it's very messy there, I'm using vim. I don't think it's supposed to be a readable file. — Ehab AlBadawy, Apr 24 '16 at 17:06
Glad you found both answers helpful! Note that you can only mark one as accepted; the choice is entirely up to you which one you pick. :-) — Martijn Pieters, Apr 24 '16 at 17:11
Yeah, thanks Martijn, I've up-voted your answer as it is what I'm looking for as well, but I found DSM post it first. Thanks anyway for your help :) — Ehab AlBadawy, Apr 24 '16 at 17:17
I stand corrected, Rax. My apologies. It wasn't my intention to mislead. — Phil Cote, Apr 24 '16 at 17:18

score 18 · Accepted Answer · answered Apr 24 '16 at 17:02

You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.

Since you are calculating a SHA1 hash, you should read the data as binary instead. The hashlib functions require you pass in bytes:

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

Note the addition of b in the file mode.

See the open() documentation:

mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)

and from the hashlib module documentation:

You can now feed this object with bytes-like objects (normally bytes) using the update() method.

score 5 · Answer 2 · answered Apr 24 '16 at 17:01

You didn't specify to open the file in binary mode, so f.read() is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

but

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

score 3 · Answer 3 · answered May 13 '17 at 10:14

Since there is not a single hint in the documentation nor src code, I have no clue why, but using the b char (i guess for binary) totally works (tf-version: 1.1.0):

image_data = tf.gfile.FastGFile(filename, 'rb').read()

For more information, check out: gfile

'utf-8' codec can't decode byte 0x80

3 Answers3

Linked