5

I'm parsing a document that have some UTF-16 encoded string.

I have a byte string that contains the following:

my_var = b'\xc3\xbe\xc3\xbf\x004\x004\x000\x003\x006\x006\x000\x006\x00-\x001\x000\x000\x003\x008\x000\x006\x002\x002\x008\x005'

When converting to utf-8, I get the following output:

print(my_var.decode('utf-8'))
#> þÿ44036606-10038062285

The first two chars þÿ indicate it's a BOM for UTF-16BE, as indicated on Wikipedia

But, what I don't understand is that if I try the UTF16 BOM like this:

if value.startswith(codecs.BOM_UTF16_BE)

This returns false. In fact, printing codecs.BOM_UTF16_BE doesn't show the same results:

print(codecs.BOM_UTF16_BE)
#> b'\xfe\xff'

Why is that? I'm suspecting some enconding issues on the higher end but not sure how to fix that one.

There are already a few mentions of how to decode UTF-16 on Stackoverflow (like this one), and they all say one thing: Decode using utf-16 and Python will handle the BOM.

... But that doesn't work for me.

print(my_var.decode('utf-16')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀

But with UTF-16BE:

print(my_var.decode('utf-16be')
#> 쎾쎿44036606-10038062285

(the bom is not removed)

And with UTF-16LE:

print(my_var.decode('utf-16le')
#> 뻃뿃㐀㐀 ㌀㘀㘀 㘀ⴀ㄀  ㌀㠀 㘀㈀㈀㠀㔀

So, for a reason I can't explain, using only .decode('UTF-16') doesn't work for me. Why?

UPDATE

The original source string isn't the one I mentioned, but this one:

source = '\376\377\0004\0004\0000\0003\0006\0006\0000\0006\000-\0001\0000\0000\0003\0008\0000\0006\0002\0002\0008\0005'

I converted it using the following:

def decode_8bit(cls, match):
    value = match.group().replace(b'\\', b'')
    return chr(int(value, base=8)).encode('utf-8')

my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)

Maybe I did something wrong here?

Cyril N.
  • 38,875
  • 36
  • 142
  • 243
  • 1
    The UTF-16 BOM is 0xFE 0xFF. Your input has something else. Probably related https://stackoverflow.com/questions/11546351/what-character-encoding-is-c3-82-c2-bf – Tomalak Sep 21 '18 at 07:05
  • 1
    The binary sequence you provided is not valid UTF-16. Checking the result of `print(...)` is not a valid way of checking encodings since `print` may not print some characters, so you shouldn't trust it. – Giacomo Alzetta Sep 21 '18 at 07:15
  • @Tomalak, I've updated my question (at the end). I forgot to mention the original source, maybe it changes everything? – Cyril N. Sep 21 '18 at 07:30

2 Answers2

1

It is right that þÿ indicates the BOM for UTF-16BE, if you use the CP1252 encoding.

The difference is the following:

Your first byte is 0xC3, which is 11000011 in binary.

  • UTF-8:

The first two bits are set and indicate that your UTF-8 char is 2 byte long. Getting 0xC3 0xBE for your first character, which is þ for UTF-8.

  • CP1252

CP1252 is always 1 byte long and returns à for 0xC3.

But if you lookup 0xC3 in your linked BOM list you won't find any matching Encoding. Looks like there wasn't a BOM in the first place.

Using the default encoding is probably the way to go, which is UTF-16LE for Windows.

Edit after original source added

Your encoding to UTF-8 destorys the BOM because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes.

OPs solution:

bytes(int(value, base=8))
Hyarus
  • 922
  • 6
  • 14
  • Thank you. I've added one more details to my question, explaining how I got the `my_var` value, maybe I did something wrong in there first? – Cyril N. Sep 21 '18 at 07:30
  • @CyrilN. \376\377 is your BOM in base8. Your encoding to UTF-8 probably destorys it because it is not valid UTF-8. Try to avoid decoding and pass on a list of bytes or use a single byte encoding if there is no other way. A python expert might be more helpful with that. – Hyarus Sep 21 '18 at 07:44
  • 2
    Oh my, replacing `chr(int(value, base=8)).encode('utf-8')` by `bytes(int(value, base=8))` did the trick! – Cyril N. Sep 21 '18 at 08:20
  • @CyrilN. Can you write that an answer, along with some reasoning what happened? I think this might be beneficial, since at least one other person here had the "c3 82 c2 bf" byte sequence and the thread over there was somewhat inconclusive. – Tomalak Sep 21 '18 at 08:22
  • @CyrilN. I added your solution to my answer. Let me know if you want to provide your own answer. In that case i will remove it ASAP. I would encourage you to do so if you know why it happend. I tried to look into the chr() and encode() function but couldnt explain why 0xFF 0xFE was converted to 0xC3 0x82 0xC2 0xBF. – Hyarus Sep 21 '18 at 09:11
  • I've answered it, but you can keep the accepted answer @Hyarus, you helped me go in the right direction. – Cyril N. Sep 21 '18 at 13:41
0

As per requested by @Tomalak and @Hyarus, here's the reason of my issue:

When decoding the 8bit values, I was returning them as UTF-8 encoded:

def decode_8bit(cls, match):
    value = match.group().replace(b'\\', b'')
    return chr(int(value, base=8)).encode('utf-8')

my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)

This was messing the data returned since it was not encoded using UTF-8 (duh). So the correct code should have been:

def decode_8bit(cls, match):
    value = match.group().replace(b'\\', b'')
    return bytes(int(value, base=8))

my_var = re.sub(b'\\\\[0-9]{1,3}', decode_8bit, source)

Hope that helps someone else... Good luck with encoding! :/

Cyril N.
  • 38,875
  • 36
  • 142
  • 243
  • Do you know the exact cause? Does chr() not return 255? Or does encode() enforce UTF-8 at all costs and switches some bits? – Hyarus Sep 21 '18 at 13:50
  • Unfortunately no, I don't have a clue. I run my code on multiple and various sources, and everything worked, so it's good. But I don't know what caused the issue. – Cyril N. Sep 21 '18 at 13:57