How does one cast a literal string of byte values in ascii that describe bytes in characters to the actual bytes?

Question

For example, I have the literal string,

00180012001a001b...

And I need to cast this literal string of characters to bytes, figure out what the right encoding is, and decode them.

There is no right answer here when it comes to the decoding format; it is something I'll have to figure out. The important thing is to start from what python has to offer out of the box.

Anyhow, there seems to be a natural partition of the characters into 32 bit binary units:

[ 'x0018', 'x0012', 'x0014', ... ] <- '...'

In some kind of reasonably straight forward and robust way. How does python3 handle this situation?

For context, I am pulling content from PDFs. The PDF mixes encodings and has an overall encoding of "ascii".

I start with a string as follows:

# original string; I've anonymized the actual bytes.
# the string is broken up with info related to PDFs

original = '[ <00010002> 1 <000300040005> 1.00708 <00060007> ] TJ'

parsed = '0001000200030004000500060007'

# obviously this won't work
bparsed = parsed.encode('ascii')
bparsed_encoding = chardet.detect(bparsed) # ascii

# what I want to see output is: WinAnsiEncoding

So the question is: how do I cast this string to its literal byte state without encoding it as an ascii-encoded string?

I've included more than enough information to answer the question I'm looking for the answer to. There is no need to dive into more context.

How is said literal string encoded? hex? decimal? something else? — SuperStormer, Jun 30 '22 at 19:13
The bytes 0x18, 0x12, 0x14 basically occur in any and all encodings. There’s no way to differentiate ASCII from whatever else given this sample. — deceze, Jun 30 '22 at 19:14
@SuperStormer ascii is original -- working on actual example — Chris, Jun 30 '22 at 19:14
And “WinAnsiEncoding” basically doesn’t mean anything either, that could still be any one of several dozen encodings. — deceze, Jun 30 '22 at 19:15
See https://stackoverflow.com/questions/5649407/hexadecimal-string-to-byte-array-in-python for a start — Barmar, Jun 30 '22 at 19:17
@deceze It is something you can read all about here: https://en.wikipedia.org/wiki/Windows-1252. Not sure what "several dozen" you are referring to. It is one of the primary half dozen encoding supported in PDF. — Chris, Jun 30 '22 at 19:29
@SuperStormer It might not always be WinAnsi. How is your comment relevant to my question? (not rhetorical) — Chris, Jun 30 '22 at 19:54
Anyways, bytes are represented by 2 hex chars, not 4, so it's unclear how you want to convert the string. — SuperStormer, Jun 30 '22 at 19:57
@wjandrea In C++ you define a hex as "0x1a". There is no backslash. That might be how debuggers represent the string. — Chris, Jun 30 '22 at 20:06
@Chris Pardon? What does that have to do with what I said? You wrote `b'x0018'`, not, say, `b'0x0018'`. — wjandrea, Jun 30 '22 at 20:08
@wjandrea the string I start from represents each hex using 4 characters; this should be clear from the example. To indicate that this string ought to be interpreted as byte chunks, I added an 'x' prefix. Only the last two characters are ever used. That is how I know they are hex chunks. I prefixed it with the phrase `# psuedo code`. There seems to be confusion as to the liberty one can take with language and communication to deliver a concept. In other words, you understood what I meant when you read what I wrote. Therefore it was effective. Now, I have re-written in verbosely. — Chris, Jun 30 '22 at 20:11
@KJ I happen to know that the binary is WinAnsi from the PDF. I have already parsed the PDF to the point where we have the interior encoding in a single line. There is a mixed encoding in this pdf. I have extracted the binary internals of a single line of WinAnsi. I am constructing a function to automatically detect that encoding, and we'll see how it goes. Anyways, looks like Barmar's comment is the bit of information I was looking for. — Chris, Jun 30 '22 at 20:14
@wjandrea anyways, the string is a `w_char` (two bytes) and is therefore a `0x0011`-style hex...or, for short `x0011`. I am not sure why you are banging on about the `0x0011` C++ style format. Python doesn't even include the zero. This is a stupid argument. — Chris, Jun 30 '22 at 20:23
@KJ if you were to hand write a hex number on a piece of paper to represent a 64 bit unicode character, there would be 4 or 8 characters? And for a 32: 2 or 4 hex characters? — Chris, Jun 30 '22 at 23:24
@KJ or have I halved them again: an 8bit character would be char (ascii), x00, 16 (smallest unicode) x0000, 32 (float) x00000000, and double, x0000000000000000. All of which initialize a literal binary value in lower level languages. But python only prints in bytes, x00, in the debugger, with the x escaped, which introduced this holy war on my question. — Chris, Jul 01 '22 at 00:07
@KJ The text is probably encrypted. It is encapsulated with "<>" instead of "()" — Chris, Jul 01 '22 at 00:41
@KJ I am pulling output from pikepdf's initial outlay. It is a pdf/a file. It is either encrypted (rather badly) to prevent anything other than adobe acrobat from editing (or an equally sophisticated parser which, for what I am doing, doesn't exist) or, as you say, some sort of error is present. But, the pdf was put into pdf/a format by the latest ghostscript from a compressed state and it renders OK. It is just resisting conversion to edit mode, and my direct editor is, at the moment, failing to decode the strings. I have tried (in the answer) to strip out the zeros. I'll try... — Chris, Jul 01 '22 at 03:39
what you recommend first. I can see that I could derive the encoding from known text clearly (it is apparent from the patterns of bytes), but that would be a short term fix. — Chris, Jul 01 '22 at 03:41
@KJ I have used pymupdf to some extent, but stopped. Pikepdf leverages qpdf. Hopefully, I'll be able to get to the bottom of the encoding problem. — Chris, Jul 01 '22 at 04:18

Chris · Accepted Answer · 2022-07-01T11:39:07.410

Anyway, although it only helped rule some ideas out, the answer I initially sought is:

byte_array = bytearray.fromhex(line)

# or, in my case, the following yielded more information, 
# although it looks like a modified encoding

simplified = ''.join([line[i:i+2] for i in range(2, len(line)-2, 4)]
byte_array = bytearray.fromhex(simplified)

I didn't have any luck, but the following could help:

import chardet
encoding = chardet.detect(byte_array)

And finally:

byte_array.decode('<your-encoding>')

How does one cast a literal string of byte values in ascii that describe bytes in characters to the actual bytes?

1 Answers1