Python problems encoding and decoding in UTF-8

Question

So, I am using Python 3 and am reading a file and assigning it to a variable into memory as bytes. I then convert the binary data to a string with:

def to_str(bytes_or_str):
  if isinstance(bytes_or_str, bytes):
    value = bytes_or_str.decode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

The reason I do this is because I want to edit and replace some of the characters in the file with a list I made containing the first 256 chr()

Once the loaded file variable is edited, I then rewrite the file as bytes with:

def to_bytes(bytes_or_str):
  if isinstance(bytes_or_str, str):
    value = bytes_or_str.encode('utf-8', 'replace')
  else:
    value = bytes_or_str
  return value

It works great, as long as I only use ASCII characters. I can use latin-1 instead of utf-8 and it works up to 256 characters, but after 256 the encoding and decoding methods are broken. Latin-1 is single byte up to 256 which I am guessing is the reason why it works up to but not beyond 256. I would like to use utf-8 because it covers a broader spectrum of characters, but it fails with my two encode/decode methods above and data gets lost if I use characters that aren't ASCII. I was wondering if this problem is caused by the fact that utf-8 uses more than one byte above chr(128) or something else? I was wondering if I need to use something like the pack() method to isolate characters using more than one byte? With this function I can find how many bytes a character in UTF-8 is:

def utf8len(x):
return len(x.encode('utf-8'))

If the loss of data error in encoding is caused by more than one byte per character, maybe I can use this somehow? Anyone have any other ideas? Thanks for any help.

Also: Lets say I change this character 'Ω' to bytes which reads as: b'\xe2\x84\xa6' in the python console. How exactly does this work if each character in bytes is a set of more characters? When I convert a character to bytes, Python displays it as characters and not 0's and 1's? Aren't bytes 0's and 1's? I don't know what Python is doing here.

I made this code to try to explain how it works but I still don't completely understand:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def string2bits(s=''):
    return [bin(ord(x))[2:].zfill(8) for x in s]

def bits2string(b=None):
    return ''.join([chr(int(x, 2)) for x in b])

def utf8len(x):
    return len(x.encode('utf-8'))

def latin1len(x):
    return len(x.encode('latin-1'))

char_num = 255
def_char = chr(char_num)

char = def_char
bit = string2bits(char)
char2 = bits2string(bit)

print ('\nString:')
print (char2)

print( '\nUTF-8 byte Len:')
print(utf8len(char))
# I had to add this next if statement because:
#  LATIN-1 can't encode character '\u0100' in position 0: ordinal not in range(256)
if char_num < 256:
    print( '\nLatin-1 byte Len:')
    print(latin1len(char))

print ('\nList of Bits:')
for x in bit:
    print (x)

At the beginning of the code in the # comment above, I can change the script encoding between utf-8 and latin-1 and also change the char_num variable to see what the string of bits are for that character in each encoding, but if its above 255 for latin-1 I get the error: UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)

If I hard code the encoding from utf-8 to latin-1 with:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

Shouldn't this code display the bits of the def_char for latin-1 encoding? How does Python work here?

Do you have any sample strings that give you good vs bad results? — JacobIRR, Sep 10 '17 at 17:11
So if I change bytes at the beginning of a Jpeg file: These are characters before:((( ÿØÿà JFIF x x ÿá "Exif MM *))) These are the same characters after: (((�� JFIF x x �� "Exif MM *))) If I used latin-1 below 256 I would have no problems because I am not replacing characters in the file that aren't within chr(0) and chr(256) — Infinity Loop, Sep 10 '17 at 17:19
A jpeg is a binary file, you can't decode it to UTF-8 because it isn't encoded in UTF-8. You should edit binary files as binary, not Unicode strings. — Mark Tolonen, Sep 10 '17 at 18:48
Well apparently I can decode any picture file (PNG, BMP, JPEG) with Latin-1 and match any character in them that are below 256 so it must be possible to match it with UTF-8 — Infinity Loop, Sep 10 '17 at 19:09
Latin1 can decode anything since it just translates bytes 0-255 to Unicode codepoints 0-255. It doesn't mean it is the correct thing to do. Leave it binary and edit it in binary. You can only decode UTF-8 if the binary data was encoded with UTF-8 to begin with (other than converting everything that doesn't make sense to the decoder with `'replace'`, but then you get a bunch of question marks, as you found). — Mark Tolonen, Sep 10 '17 at 20:13
FYI, `#coding` specifies the encoding of the source file itself, and is used so Python can correctly generate the Unicode codepoints for Unicode string constants. It doesn't change how the code works otherwise. — Mark Tolonen, Sep 10 '17 at 20:20
Well the problem is, I am trying to code for a particular effect and the idea requires me to do it this way for the effect to work. So #coding only affects functions that use the encoding parameter. Okay, so I need to rewrite it a different way. — Infinity Loop, Sep 10 '17 at 20:20
Maybe you should explain what you are really trying to do instead. — Mark Tolonen, Sep 10 '17 at 20:21
Try typing `bytes(range(256))` and `bytes(range(256)).decode('l1')`. — Josh Lee, Sep 10 '17 at 20:22
@Josh Lee Of course! Good idea, that helps to explain the differences. But I still don't understand why it doesn't work above 256. — Infinity Loop, Sep 10 '17 at 20:26
`#coding` doesn't affect functions using encoding at all. It informs Python of the encoding of the source file so it can generate the correct Unicode codepoints for Unicode strings. See [this answer](https://stackoverflow.com/a/3170647/235698) for some examples. — Mark Tolonen, Sep 10 '17 at 20:33
@ Mark Tolonen So I think I can do a workaround to make it editable as bytes instead of converting the read bytes to strings and back to bytes to be written. You are right. I should just keep it as binary data to get rid of all this encoding mess. I will create a list of characters as bytes. — Infinity Loop, Sep 10 '17 at 20:34

Hatatister · Answer 1 · 2017-09-10T18:17:19.460

0

I think the problem is, that in a jpeg header there are stored values which can have any value of a byte (for example pixel density, length of markers and so on).

https://en.wikipedia.org/wiki/JPEG_File_Interchange_Format

In Latin-1 every character is one byte, but not every value between 0-255 is defined.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

However, UTF-8 is an multibyte encoding. If you exceed 127, the first byte has to start with 110 (for 2 byte chars), 1110 (for three byte chars) and 11110 (for four byte chars). The second, third and fourth byte have to start with 10...

https://en.wikipedia.org/wiki/UTF-8

So probability of getting invalid byte(sequences) is high if you read arbitrary bytes and you probably do so by reading a jpeg header. Therefore it can be, that you got valid bytes for Latin-1 and not for UTF-8 incidentally.

edited Sep 10 '17 at 18:17

answered Sep 10 '17 at 17:42

Hatatister

962
6
11

So if I read a jpeg file as bytes, then I print() the file, it shows the first character as bytes and the rest as the correct characters. So Python must know something. In order to assign the correct bytes to the correct characters, I would need a method to segregate the correct length of bits to its particular character or what? – Infinity Loop Sep 10 '17 at 18:08
The first 127 character (0x7F) are identical to ASCII. If the byte has an higher value (>0x7F) it represents an character which is encoded by multiple bytes. For this certain rules apply to with which bits the bytes must start (link added in answer). Therefore values >127 cannot be identical with Latin-1 encoding and if the byte sequences do not match the utf-8 standard, this is quited with an exception or if you use "replace" keyword( like in your case) bad characters are replaced by a question marks. – Hatatister Sep 10 '17 at 18:25

Python problems encoding and decoding in UTF-8

1 Answers1