How do I extract (hexadecimal-encoded) text from the content of a PDF?

Question

I have two versions of a PDF and I know they're slightly different—the "Reassessment" text in the gray bar, on Page 3:

online PDF diff

I'm trying to get the textual difference on my machine.

I used pdfcpu to extract the content from the multi-page PDF and then ran page 3 through the diff utility:

% diff out_orig/page_3.txt out_new/page_3.txt 

1650a1651,1658
> BT
> 1 0 0 rg
> 0 i 
> /RelativeColorimetric ri
> /C2_2 9.96 Tf
> 0 Tw 358.147 648.779 Td
> <0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056>Tj
> ET

I've looked up 7.3.4.3 Hexadecimal String in the PDF reference:

A hexadecimal string shall be written as a sequence of hexadecimal digits encoded as ASCII characters and enclosed within angle brackets.

and so I thought I should be able to do something as simple as interpreting the hex characters directly as ASCII text:

>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> import binascii
>>> binascii.a2b_hex(s)
b'\x005\x00H\x00D\x00V\x00V\x00H\x00V\x00V\x00P\x00H\x00Q\x00W\x00\x03\x000\x00X\x00V\x00W\x00\x03\x002\x00F\x00F\x00X\x00U\x00\x03\x00(\x00Y\x00H\x00U\x00\\\x00\x03\x00\x16\x00\x03\x000\x00R\x00Q\x00W\x00K\x00V'

but I'm getting garbage. Even without the null bytes:

>>> binascii.a2b_hex(s).replace(b'\x00', b'')
b'5HDVVHVVPHQW\x030XVW\x032FFXU\x03(YHU\\\x03\x16\x030RQWKV'

I expect it to look something like this (in reverse):

>>> binascii.b2a_hex(b'Reassessment Must Occur Every 3 Months')
b'52656173736573736d656e74204d757374204f636375722045766572792033204d6f6e746873'

I found this comment on this somewhat-related SO post:

Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.?)" * - That's only true for simple examples using standard font encoding. Meanwhile custom encodings for embedded fonts have become very common.

So... maybe that hex string isn't just hex-encoded ASCII?

What am I missing in trying to extract the textual difference?

It's strange to me that the hex string contains no "abcdef" bytes. It's possible, but very improbable for a string this long. I think that this is not a hex-string. — Michael Ruth, Sep 09 '21 at 22:48
@MichaelRuth, That was my first impression, but there is a `B` in the last 3 bytes, `4B0056`. It's all the `00` null bytes that I don't understand. — Zach Young, Sep 09 '21 at 22:49
Oh, it's not UTF-16. I think it's maybe a custom encoding, would need to see the original PDF to find that info though. — wim, Sep 09 '21 at 23:58
@wim: thanks for commenting on that. Any resources you can point me to that would help me dig that up, or even understand the problem space better? Also, where did that `+29` offset in your solution come from? Did you just *see* the offset yourself? — Zach Young, Sep 10 '21 at 00:41
The hex string is not a text string as you expect, the double byte hex codes are glyph indices in font's glyf table. The /C2_2 font object in PDF should includes a ToUnicode cmap object that maps glyph indices to actual characters. Usually the font generators place the glyphs in the glyf table in the same order like the characters (at least for the ones in the English alphabet) so if you can guess an offset (that is font specific), like the 29, you can do a "brute force" mapping for some characters. — iPDFdev, Sep 10 '21 at 07:23
@iPDFdev describes what most likely is the case in the PDF in question. In general the situation may be more complicated. — mkl, Sep 10 '21 at 08:53

score 4 · Accepted Answer · answered Sep 09 '21 at 23:02

4

Here we go:

>>> s = '0035004800440056005600480056005600500048005100570003003000580056005700030032004600460058005500030028005900480055005C0003001600030030005200510057004B0056'
>>> ns = [29 + int(c, 16) for c in chunks(s, 4)]
>>> print(bytes(ns))
b'Reassessment Must Occur Every 3 Months'

chunks is copied from here.

answered Sep 09 '21 at 23:02

wim

338,267
99
616
750

4

That works, but where does the 29-code offset come from? – Mark Reed Sep 09 '21 at 23:06
You could also just do `''.join([chr(29+int(s[i:i+4],16)) for i in range(0,len(s),4)])` and not need the `chunks` definition. – Mark Reed Sep 09 '21 at 23:12
1

This works sometimes. Sometimes not. The encoding depends on the PDF font in question. And that PDF font may have pretty arbitrary encodings. – mkl Sep 10 '21 at 08:29

score 3 · Answer 2 · answered Jun 09 '22 at 15:05

No it is not ASCII encoding. ASCII encoding is limited to 8 bits.

Multibyte character codes are for pdf Composite Fonts, and specify the glyph to be drawn by its index in the glyph table. Essentially there is no character map. There is a reverse mapping from these glyph indexes to Unicode, to make text searches possible.

The common OpenType font format requires glyph index 0 = .notdef, 1 = .null, 2 = CR and 3 = space(ASCII code 32). Note that 32 - 3 = 29.

So an OpenType composite font created for the ASCII character set, missing non-printing characters 0 to 31 will have the property:

Glyph index + 29 = ASCII

How do I extract (hexadecimal-encoded) text from the content of a PDF?

2 Answers2

Linked