-1

For example, I have the literal string,

00180012001a001b...

And I need to cast this literal string of characters to bytes, figure out what the right encoding is, and decode them.

There is no right answer here when it comes to the decoding format; it is something I'll have to figure out. The important thing is to start from what python has to offer out of the box.

Anyhow, there seems to be a natural partition of the characters into 32 bit binary units:

[ 'x0018', 'x0012', 'x0014', ... ] <- '...' 

In some kind of reasonably straight forward and robust way. How does python3 handle this situation?


For context, I am pulling content from PDFs. The PDF mixes encodings and has an overall encoding of "ascii".

I start with a string as follows:

# original string; I've anonymized the actual bytes.
# the string is broken up with info related to PDFs

original = '[ <00010002> 1 <000300040005> 1.00708 <00060007> ] TJ'

parsed = '0001000200030004000500060007'

# obviously this won't work
bparsed = parsed.encode('ascii')
bparsed_encoding = chardet.detect(bparsed) # ascii

# what I want to see output is: WinAnsiEncoding

So the question is: how do I cast this string to its literal byte state without encoding it as an ascii-encoded string?


I've included more than enough information to answer the question I'm looking for the answer to. There is no need to dive into more context.

Chris
  • 28,822
  • 27
  • 83
  • 158
  • How is said literal string encoded? hex? decimal? something else? – SuperStormer Jun 30 '22 at 19:13
  • You really want literal `x` bytes in the result? – Barmar Jun 30 '22 at 19:14
  • The bytes 0x18, 0x12, 0x14 basically occur in any and all encodings. There’s no way to differentiate ASCII from whatever else given this sample. – deceze Jun 30 '22 at 19:14
  • @SuperStormer ascii is original -- working on actual example – Chris Jun 30 '22 at 19:14
  • What do you mean by `b'x0018'`? Maybe `b'\x18'`? – wjandrea Jun 30 '22 at 19:14
  • And “WinAnsiEncoding” basically doesn’t mean anything either, that could still be any one of several dozen encodings. – deceze Jun 30 '22 at 19:15
  • 2
    See https://stackoverflow.com/questions/5649407/hexadecimal-string-to-byte-array-in-python for a start – Barmar Jun 30 '22 at 19:17
  • @deceze It is something you can read all about here: https://en.wikipedia.org/wiki/Windows-1252. Not sure what "several dozen" you are referring to. It is one of the primary half dozen encoding supported in PDF. – Chris Jun 30 '22 at 19:29
  • @SuperStormer It might not always be WinAnsi. How is your comment relevant to my question? (not rhetorical) – Chris Jun 30 '22 at 19:54
  • Anyways, bytes are represented by 2 hex chars, not 4, so it's unclear how you want to convert the string. – SuperStormer Jun 30 '22 at 19:57
  • @wjandrea In C++ you define a hex as "0x1a". There is no backslash. That might be how debuggers represent the string. – Chris Jun 30 '22 at 20:06
  • @Chris Pardon? What does that have to do with what I said? You wrote `b'x0018'`, not, say, `b'0x0018'`. – wjandrea Jun 30 '22 at 20:08
  • @wjandrea the string I start from represents each hex using 4 characters; this should be clear from the example. To indicate that this string ought to be interpreted as byte chunks, I added an 'x' prefix. Only the last two characters are ever used. That is how I know they are hex chunks. I prefixed it with the phrase `# psuedo code`. There seems to be confusion as to the liberty one can take with language and communication to deliver a concept. In other words, you understood what I meant when you read what I wrote. Therefore it was effective. Now, I have re-written in verbosely. – Chris Jun 30 '22 at 20:11
  • @KJ I happen to know that the binary is WinAnsi from the PDF. I have already parsed the PDF to the point where we have the interior encoding in a single line. There is a mixed encoding in this pdf. I have extracted the binary internals of a single line of WinAnsi. I am constructing a function to automatically detect that encoding, and we'll see how it goes. Anyways, looks like Barmar's comment is the bit of information I was looking for. – Chris Jun 30 '22 at 20:14
  • @wjandrea anyways, the string is a `w_char` (two bytes) and is therefore a `0x0011`-style hex...or, for short `x0011`. I am not sure why you are banging on about the `0x0011` C++ style format. Python doesn't even include the zero. This is a stupid argument. – Chris Jun 30 '22 at 20:23
  • @KJ Makes sense, appreciate the tip. – Chris Jun 30 '22 at 23:06
  • @KJ if you were to hand write a hex number on a piece of paper to represent a 64 bit unicode character, there would be 4 or 8 characters? And for a 32: 2 or 4 hex characters? – Chris Jun 30 '22 at 23:24
  • @KJ or have I halved them again: an 8bit character would be char (ascii), x00, 16 (smallest unicode) x0000, 32 (float) x00000000, and double, x0000000000000000. All of which initialize a literal binary value in lower level languages. But python only prints in bytes, x00, in the debugger, with the x escaped, which introduced this holy war on my question. – Chris Jul 01 '22 at 00:07
  • @KJ The text is probably encrypted. It is encapsulated with "<>" instead of "()" – Chris Jul 01 '22 at 00:41
  • @KJ I am pulling output from pikepdf's initial outlay. It is a pdf/a file. It is either encrypted (rather badly) to prevent anything other than adobe acrobat from editing (or an equally sophisticated parser which, for what I am doing, doesn't exist) or, as you say, some sort of error is present. But, the pdf was put into pdf/a format by the latest ghostscript from a compressed state and it renders OK. It is just resisting conversion to edit mode, and my direct editor is, at the moment, failing to decode the strings. I have tried (in the answer) to strip out the zeros. I'll try... – Chris Jul 01 '22 at 03:39
  • what you recommend first. I can see that I could derive the encoding from known text clearly (it is apparent from the patterns of bytes), but that would be a short term fix. – Chris Jul 01 '22 at 03:41
  • @KJ I have used pymupdf to some extent, but stopped. Pikepdf leverages qpdf. Hopefully, I'll be able to get to the bottom of the encoding problem. – Chris Jul 01 '22 at 04:18

1 Answers1

0

Anyway, although it only helped rule some ideas out, the answer I initially sought is:

byte_array = bytearray.fromhex(line)

# or, in my case, the following yielded more information, 
# although it looks like a modified encoding

simplified = ''.join([line[i:i+2] for i in range(2, len(line)-2, 4)]
byte_array = bytearray.fromhex(simplified)

I didn't have any luck, but the following could help:

import chardet
encoding = chardet.detect(byte_array)

And finally:

byte_array.decode('<your-encoding>')
Chris
  • 28,822
  • 27
  • 83
  • 158