Decoding a bytes sequence - what's the train of thought when doing it

Question

I have this sequence and I have to decode it, as a complete beginner in Python and in encoding.

enc = b'\x80\x03}q\x00(K\x01K\x01K\x02K\x03K\x03K\x06K\x04G?\xc5UUUUUUK\x05G?\xe0\x00\x00\x00\x00\x00\x00K\x06G?\x9cq\xc7\x1cq\xc7\x1cK\x07G?\xc5UUUUUUK\x08K$K\tG?\xb5UUUUUUK\nK\x07K\x0bG?\xe5UUUUUUK\x0cG?\xb5UUUUUUK\rG?\xedUUUUUUK\x0eK4K\x0fG?\xb3\xb1;\x13\xb1;\x14K\x10K\x00K\x11G?\xcd\x89\xd8\x9d\x89\xd8\x9eK\x12G?\xcb\x9b\x9b\x9b\x9b\x9b\x9cK\x13G?\xa4\x14\x14\x14\x14\x14\x14K\x14X\x08\x00\x00\x00discretaq\x01K\x15K\x02K\x16X\x02\x00\x00\x00daq\x02K\x17G?\xe4z\xe1G\xae\x14{K\x18G@\x15\x00\x00\x00\x00\x00\x00K\x19G?\xe4z\xe1G\xae\x14|K\x1aK2K\x1bK\x01K\x1cK\x03K\x1dG?\xd5UUUUUUK\x1eG?\xc5UUUUUUK\x1fK\x01K K\x04K!G?\xaf\xf2\xe4\x8e\x8aq\xdeK"K\x04K#X\x04\x00\x00\x00mareq\x03u.'

I tried doing it this way

strputere = enc.decode()

print(strputere)

and I get an error

File "encode.py", line 4, in <module>
    strputere = enc.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I started doing a bit of research, and I found that b stands for bytes.

So my enc variable is a bytes string literal. I've looked into .decode() and it seemed like it was a good choice - but it might be not.

I'm a bit confused because it is a bytes string literal, but it contains some characters (such as \x80) that I think they are UTF-8 characters.

So, how can I decode this, and what would be the algorithm for that? I would love to understand what happens, I did my research but I'm a bit lost, I'd need some help.

A byte sequence is ambiguous since the same sequence of bytes can represent different data in different situations. What is the context? What does "decode" mean for you here? — John Coleman, Oct 14 '21 at 11:24
@JohnColeman These are the answers for some questions, which are represented by real numbers. (1, 1/6, 3, 99.2, ect) — , Oct 14 '21 at 11:26
Just a note to add on my answer here, I ran chardet on your data, and it is almost certainly not a standard string encoding. Most likely its a group of data, which means unless you know the exact structure of the data it is quite difficult to find it. — Zaid Al Shattle, Oct 14 '21 at 11:28
The number of bytes in `enc` is not a multiple of 64, so it doesn't split nicely into a sequence of floats. — John Coleman, Oct 14 '21 at 11:29
The question that may help us help you is: what technology was used to _encode_ these numbers (1, 1/6, 3, ...) into a byte array? Do you know what the decoded sequence should be? — xtofl, Oct 14 '21 at 11:29
"I have this sequence and I have to decode it" Ok, well, *well how is it encoded*? The important thing to understand is that *bytes are just bytes*. If they have any particular meaning, that is **up to whoever created those bytes**. They may have intended them to encode text. In which case, you need to know the text encoding. Just because bytes *can* be decoded using some text encoding doesn't mean that you have what was *intended* — juanpa.arrivillaga, Oct 14 '21 at 11:56

score 3 · Answer 1 · answered Oct 14 '21 at 11:21

So, generally when you have a byte sequence you have two different ways to approach it, depending on the contents:

Is it a pure string sequence?

If dealing with a pure string sequence, you need to decode using the following:

enc.decode("utf-8")

Keep in mind that in this case, you must know what encoding was used (here utf-8). But it appears that it might be incorrect according to the error message you got. S

If you don't know the encoding but you know its definitely a string-encoding, you can take a look at the options mentioned in this question here

Sensor/Other input

If you are using an embedded device, or any bytes input that might contain a series of data, and not just one field, you must use struct.unpack(). This is a bit more complicated, and you will need to go through the docs to find the exact string you must use to decode.

The way it works is that you tell python what each bytes are (string, int, etc) and how long each one is, and it will convert it into a tuple of objects as follows:

values = list(struct.unpack('>BBHBBhBHhHL', enc))

score 2 · Accepted Answer · answered Oct 14 '21 at 11:37

2

These data are encoded by using the python pickle module. You can decode it so:

>>> import pickle
>>> numbers = pickle.loads(enc)
>>> print(numbers)
{1: 1, 2: 3, 3: 6, 4: 0.16666666666666666, 5: 0.5, 6: 0.027777777777777776, 7: 0.16666666666666666, 8: 36, 9: 0.08333333333333333, 10: 7, 11: 0.6666666666666666, 12: 0.08333333333333333, 13: 0.9166666666666666, 14: 52, 15: 0.07692307692307693, ...

answered Oct 14 '21 at 11:37

xtofl

40,723
12
105
192

1

Great find! May I ask how you recognized its a pickle file? Was it a certain format/character from the bytes? – Zaid Al Shattle Oct 14 '21 at 11:41
1

A hunch, triggered by there being no apparent character format, the "the answers for some questions" comment by OP, strengthened by the `\x80` recurring in https://docs.python.org/3/library/pickletools.html. In other words, sheer luck. – xtofl Oct 14 '21 at 13:12
@xtofl I wouldn't say sheer luck. It was an educated guess rather than a completely random guess. – John Coleman Oct 14 '21 at 14:17

score 1 · Answer 3 · answered Oct 14 '21 at 11:30

1

The error is happening because the string contains non-ASCII characters which are not decodable using utf-8.

Is it just random data or is it encoded using some particular encoding? Decoding using "unicode_escape" does work, but the output does not appear that useful.

enc.decode("unicode_escape")

returns:

'\x80\x03}q\x00(K\x01K\x01K\x02K\x03K\x03K\x06K\x04G?ÅUUUUUUK\x05G?à\x00\x00\x00\x00\x00\x00K\x06G?\x9cqÇ\x1cqÇ\x1cK\x07G?ÅUUUUUUK\x08K$K\tG?µUUUUUUK\nK\x07K\x0bG?åUUUUUUK\x0cG?µUUUUUUK\rG?íUUUUUUK\x0eK4K\x0fG?³±;\x13±;\x14K\x10K\x00K\x11G?Í\x89Ø\x9d\x89Ø\x9eK\x12G?Ë\x9b\x9b\x9b\x9b\x9b\x9cK\x13G?¤\x14\x14\x14\x14\x14\x14K\x14X\x08\x00\x00\x00discretaq\x01K\x15K\x02K\x16X\x02\x00\x00\x00daq\x02K\x17G?äzáG®\x14{K\x18G@\x15\x00\x00\x00\x00\x00\x00K\x19G?äzáG®\x14|K\x1aK2K\x1bK\x01K\x1cK\x03K\x1dG?ÕUUUUUUK\x1eG?ÅUUUUUUK\x1fK\x01K K\x04K!G?¯òä\x8e\x8aqÞK"K\x04K#X\x04\x00\x00\x00mareq\x03u.'

answered Oct 14 '21 at 11:30

Fredrik Reinholdsen

11
1

This output seems to be the exact same as the input, (aka It appears it doesn't realy change it much?) – Zaid Al Shattle Oct 14 '21 at 11:35
@ZaidAlShattle it changes it *entirely*, it is a *completely different type of object*. It was `bytes`, not it is a `str` – juanpa.arrivillaga Oct 14 '21 at 11:50
Well, yes, but in reality it seems almost the same as using `str(enc)`, which would create a string, but it wouldn't *decode* it. At least that's what I see from taking a quick look at the output @juanpa.arrivillaga – Zaid Al Shattle Oct 14 '21 at 11:52
I think what happens is that it simply skips decoding characters it does not know what to do with, which in this case appears to be most of them. Normally when your decoding something, if the info should be at all useful you need to know how the information was encoded. If you want to just decode it to ints you can do so simply using struct.unpack like others mentioned. – Fredrik Reinholdsen Oct 14 '21 at 12:37
Decoding a byte string is a bit like translating a language to another. If you don't know the original language then the output is not that useful. And even if you find certain words that exist in both languages they might mean the same thing. What this command does is essentially trying to translate from utf-8, but if it cant find the word in the dictionary simply skips it. – Fredrik Reinholdsen Oct 14 '21 at 12:47

Decoding a bytes sequence - what's the train of thought when doing it

3 Answers3