0

I get UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte

When I try to call codecs.decode(X, 'utf-8') where X = b'\xe8\xd0\xca@\xee\xe4\xca\xc6\xd6@\xde\xcc@\xe8\xd0\xca@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6\x14\xc4\xf2@\xd0\xca\xdc\xe4\xf2@\xee\xc2\xc8\xe6\xee\xde\xe4\xe8\xd0@\xd8\xde\xdc\xce\xcc\xca\xd8\xd8\xde\xee\x14\x14\xd2\xe8@\xee\xc2\xe6@\xe8\xd0\xca@\xe6\xc6\xd0\xde\xde\xdc\xca\xe4@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6\x14@@@@@@\xe8\xd0\xc2\xe8@\xe6\xc2\xd2\xd8\xca\xc8@\xe8\xd0\xca@\xee\xd2\xdc\xe8\xe4\xf2@\xe6\xca\xc2\x14\xc2\xdc\xc8@\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xc2\xc8@\xe8\xc2\xd6\xca\xdc@\xd0\xd2\xe6@\xd8\xd2\xe8\xe8\xd8\xca@\xc8\xc2\xea\xce\xd0\xe8\xca\xe4\x14@@@@@@\xe8\xde@\xc4\xca\xc2\xe4@\xd0\xd2\xda@\xc6\xde\xda\xe0\xc2\xdc\xf2\\\x14\x14\xc4\xd8\xea\xca@\xee\xca\xe4\xca@\xd0\xca\xe4@\xca\xf2\xca\xe6@\xc2\xe6@\xe8\xd0\xca@\xcc\xc2\xd2\xe4\xf2Z\xcc\xd8\xc2\xf0\x14@@@@@@\xd0\xca\xe4@\xc6\xd0\xca\xca\xd6\xe6@\xd8\xd2\xd6\xca@\xe8\xd0\xca@\xc8\xc2\xee\xdc@\xde\xcc@\xc8\xc2\xf2\x14\xc2\xdc\xc8@\xd0\xca\xe4@\xc4\xde\xe6\xde\xda@\xee\xd0\xd2\xe8\xca@\xc2\xe6@\xe8\xd0\xca@\xd0\xc2\xee\xe8\xd0\xde\xe4\xdc@\xc4\xea\xc8\xe6\x14@@@@@@\xe8\xd0\xc2\xe8@\xde\xe0\xca@\xd2\xdc@\xe8\xd0\xca@\xda\xde\xdc\xe8\xd0@\xde\xcc@\xda\xc2\xf2\\\x14\x14\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xca@\xe6\xe8\xde\xde\xc8@\xc4\xca\xe6\xd2\xc8\xca@\xe8\xd0\xca@\xd0\xca\xd8\xda\x14@@@@@@\xd0\xd2\xe6@\xe0\xd2\xe0\xca@\xee\xc2\xe6@\xd2\xdc@\xd0\xd2\xe6@\xda\xde\xea\xe8\xd0\x14\xc2\xdc\xc8@\xd0\xca@\xee\xc2\xe8\xc6\xd0\xca\xc8@\xd0\xde\xee@\xe8\xd0\xca@\xec\xca\xca\xe4\xd2\xdc\xce@\xcc\xd8\xc2\xee@\xc8\xd2\xc8@\xc4\xd8\xde\xee\x14@@@@@@\xe8\xd0\xca@\xe6\xda\xde\xd6\xca@\xdc\xde\xee@\xee\xca\xe6\xe8@\xdc\xde\xee@\xe6\xde\xea\xe8\xd0\\\x14\x14\xe8\xd0\xca\xdc@\xea\xe0@\xc2\xdc\xc8@\xe6\xe0\xc2\xd6\xca@\xc2\xdc@\xde\xd8\xc8@\xe6\xc2\xd2\xd8\xde\xe4\x14@@@@@@\xd0\xc2\xc8@\xe6\xc2\xd2\xd8\xca\xc8@\xe8\xde@\xe8\xd0\xca@\xe6\xe0\xc2\xdc\xd2\xe6\xd0@\xda\xc2\xd2\xdc\x14\xd2@\xe0\xe4\xc2\xf2@\xe8\xd0\xca\xca@\xe0\xea\xe8@\xd2\xdc\xe8\xde@\xf2\xde\xdc\xc8\xca\xe4@\xe0\xde\xe4\xe8\x14@@@@@@\xcc\xde\xe4@\xd2@\xcc\xca\xc2\xe4@\xc2@\xd0\xea\xe4\xe4\xd2\xc6\xc2\xdc\xca\\\x14\x14\xd8\xc2\xe6\xe8@\xdc\xd2\xce\xd0\xe8@\xe8\xd0\xca@\xda\xde\xde\xdc@\xd0\xc2\xc8@\xc2@\xce\xde\xd8\xc8\xca\xdc@\xe4\xd2\xdc\xce\x14@@@@@@\xc2\xdc\xc8@\xe8\xdeZ\xdc\xd2\xce\xd0\xe8@\xdc\xde@\xda\xde\xde\xdc@\xee\xca@\xe6\xca\xca\x14\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xca@\xc4'

I also tried to use binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8') where bNum is a long binary string

The text was originally from a utf-8 encoded .txt file

EDIT: Lets say we have two bit strings, the first is the exact bit string from converting some text to a bit string. The second is extracted from an image. The second is exactly the same as the first up to the point where it was cut off because the image it was being hidden in didn't have enough pixels.

example: http://pastebin.com/NnaH9dEb

why would it throw UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte error if they both contain the same data up to the point the second one cuts off?

EDIT2: when I convert the two bit strings to hex via hex(int(<var name>, 2)) I get different results, but converting only the first couple of bytes returns the same result.

Darrel Holt
  • 870
  • 1
  • 15
  • 39
  • I would challenge your assumption that the source was UTF-8 encoded. – Mark Ransom Apr 25 '16 at 03:41
  • @MarkRansom I checked and re-saved it with Notepad++ before and I just did it again, I still have the same problem. – Darrel Holt Apr 25 '16 at 04:05
  • @MarkRansom here's the code to my program if you would like to take a look: http://pastebin.com/ZibMjms3 It hides the text in an image. Then I get this error when I'm trying to retrieve it, so maybe it's my hide function causing the issue. The problem only arises when the text to put into the image exceeds the size of the image (not enough pixels to place each bit in to later reconstruct the text). With pure utf-8 text like Russian it works fine and just cuts off what couldn't fit into the image, but with regular English characters it gives me this error. – Darrel Holt Apr 25 '16 at 04:12
  • `\xe8\xd0`. Nope, not UTF-8. The highest valid continuation byte after 0xe8 is 0xbf, and that encodes characters deep into CJK Unified Ideographs. – Ignacio Vazquez-Abrams Apr 25 '16 at 05:29
  • @IgnacioVazquez-Abrams that hexcode above is `decMsg` from http://pastebin.com/NnaH9dEb. If I run : `for i in range(16,len(initMsg),16): print('\ninit: ' + hex(int(initMsg[0:i], 2))) print('dec: ' + hex(int(decMsg[0:i], 2))) print('\ninit: ' + hex(int(initMsg[:-i], 2)))` then the second `initMsg` prints correctly, but the first one prints exactly like `decMsg`. Why is this? – Darrel Holt Apr 25 '16 at 05:38

1 Answers1

2

The decode of decMsg is misaligned. If I add 7 zero bits to the end of the message or truncate the last bit, it decodes with my method. Your code was TL;DR.

import math

initMsg = '11101000110100001100101...'  # truncated due post limits.
decMsg = '11101000110100001100101...'

# Only printing the first 25 chars of the message for bevity:

a = int(initMsg,2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg,2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg+'0000000',2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg[:-1],2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

Output:

b'the wreck of the hesperus'
b'\xe8\xd0\xca@\xee\xe4\xca\xc6\xd6@\xde\xcc@\xe8\xd0\xca@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6'
b'the wreck of the hesperus'
b'the wreck of the hesperus'

Compare \xe8 to t in binary:

>>> format(ord('t'),'08b')
'01110100'
>>> format(0xe8,'08b')
'11101000' 
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks a ton mark, this really cleared it up for me. – Darrel Holt Apr 25 '16 at 07:44
  • @DarrelHolt related: [Convert binary to ASCII and vice versa](http://stackoverflow.com/q/7396849/4279) – jfs Apr 25 '16 at 08:10
  • @J.F.Sebastian Thanks Sebastian, I saw your post on there about support for all Unicode characters, it looks really good. – Darrel Holt Apr 25 '16 at 08:19
  • @DarrelHolt to shift everything by 1 bit without converting to a big binary number: `y=''.join(chr(((ord(hi)&0x01)<<7) | (ord(lo)>>1)) for hi,lo in zip('\0'+x,x))` – Mark Ransom Apr 25 '16 at 14:24