Reading PDF, TJ operator strange encoding

Question

I'm currently trying to extract text from a PDF document, but I encountered some strange cases with the Tj operator. Normally I dealt with cases like these:

   Tc (SOME_TEXT) TJ

Now I encounter a case like this:

Which converts to string '52249.64'. Now I have encountered yet another strange case:

Only info I could find is this: A string passed to Tj is always to be interpreted according to the Encoding or CMap for the font. (In this case I expect it is a CIDFont with a CMap)

Td  (
        \t\004\007\020\007\016\016\026\020
    )
Tj

I still don't understand. Are these some kind of indexes that indicate an offset in some kind of character array or do I have to decode these values? Thanks!

Threre's nothing strange with that. You know that there's a document (http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf) where all this is explained? — Paulo Soares, Oct 29 '15 at 11:50
@PauloSoares The link is dead. Do you remember a google-friendly title of the document? — Hermann, Mar 22 '23 at 13:04
@Hermann *"The link is dead. "* - You can retrieve the current PDF spec as described in [this answer](https://stackoverflow.com/a/75950220/1729265). — mkl, Jun 16 '23 at 08:50

mkl · Accepted Answer · 2023-06-16T08:41:46.433

As @Paulo already indicated in his comment, you should first consult the PDF specification, i.e. currently ISO 32000-1 a free copy of which is provided by Adobe here.

On the topic of text extraction you'll find in particular section 9.10 Extraction of Text Content, especially:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

^{If some of the terms here are unknown to you, read about them in ISO 32000-1 or the other specifications referenced there.}

For an acceptable text extraction result, therefore, make your text extractor support the method presented in that section.

Yes, i get that 503, too. On the other hand the Adobe.com/go URLs are meant to be more stable. Oh well, as ISO 32000-2 is now available for free, maybe they don't care. — mkl, Jun 17 '23 at 08:46

Reading PDF, TJ operator strange encoding

1 Answers1

Linked