0

How is the text processed while converting a .doc file to .pdf file.I tried to intercept the "Tj" operator using Pdfbox. The sentence "interchange features of PDF. Again, the resulting PDF file can be viewed with a viewer application, such as " is broken into

"interchange features of PDF. Agai" & "n, the resulting PDF file can be viewed with a viewer application, such as ".arguments to the TJ operator were

[COSArray{[COSString{in}, COSInt{5}, COSString{t}, COSInt{5}, COSString{er}, COSInt{-4}, COSString{ch}, COSInt{5}, COSString{an}, COSInt{4}, COSString{g}, COSInt{5}, COSString{e }, COSInt{-2}, COSString{f}, COSInt{10}, COSString{eat}, COSInt{5}, COSString{ur}, COSInt{10}, COSString{es o}, COSInt{6}, COSString{f }, COSInt{-2}, COSString{P}, COSInt{6}, COSString{DF}, COSInt{6}, COSString{.}, COSInt{13}, COSString{ Ag}, COSInt{3}, COSString{ai}]}] and 

[COSArray{[COSString{n, t}, COSInt{6}, COSString{he }, COSInt{10}, COSString{r}, COSInt{-2}, COSString{esu}, COSInt{5}, COSString{lt}, COSInt{8}, COSString{in}, COSInt{5}, COSString{g}, COSInt{5}, COSString{ P}, COSInt{4}, COSString{DF}, COSInt{6}, COSString{ f}, COSInt{-2}, COSString{il}, COSInt{5}, COSString{e }, COSInt{8}, COSString{ca}, COSInt{4}, COSString{n b}, COSInt{3}, COSString{e }, COSInt{8}, COSString{view}, COSInt{9}, COSString{ed wit}, COSInt{6}, COSString{h a}, COSInt{14}, COSString{ v}, COSInt{-3}, COSString{ie}, COSInt{12}, COSString{we}, COSInt{8}, COSString{r}, COSInt{8}, COSString{ app}, COSInt{5}, COSString{li}, COSInt{5}, COSString{ca}, COSInt{4}, COSString{t}, COSInt{5}, COSString{io}, COSInt{7}, COSString{n, s}, COSInt{6}, COSString{uc}, COSInt{5}, COSString{h as}, COSInt{7}, COSString{ }]}]

Is the because of the way a .doc is converted into a pdf? or is it because of the textblocks refered in the last answer of this question.What is the significance of those COSInt between the COSString ? i dont really understand about textblock but i dont think there should be a problem if i try to intercept the Tj operator.would it be the same if i try to process a pdf creating from a pdf file?

Community
  • 1
  • 1
programer8
  • 567
  • 1
  • 6
  • 17

1 Answers1

1

First of all: it's not correct to state "a .doc file gets converted to a PDF". It is not a conversion of any kind; rather, the document is rendered to a virtual printer, and the virtual printer writes out PDF text commands that form the pages. The order in which objects (text and graphics) appear inside a PDF is not determined by the contents of the original document; the virtual printer may process the objects in any order.

Don't mix up TJ and Tj. Per Adobe's PDF Reference 1.7:

5.3.2 Text-Showing Operators ...

string Tj Show a text string.

array TJ Show one or more text strings, allowing individual glyph positioning. [...] The number is expressed in thousandths of a unit of text space.

Tj shows a continuous text string, for TJ the COSInts in between are horizontal offsets between the individual text strings. However, that does not imply that everything drawn with Tj was a single text string to begin with. The PDF generator may split up a single longer sentence into separate Tj instructions; for instance, to group same font and size texts together.

Similarly, a TJ array may contain only very small adjustments between separate text fragments, to implement character level kerning or tracking; but it also may contain larger distances to create custom spaces, mimic tabs, or overprint characters.

The "text block" you refer to are string operands:

A string operand of a text-showing operator is interpreted as a sequence of character codes identifying the glyphs to be painted.

..

Strings presented to the text-showing operators may be of any length—even a single character code per string—and may be placed on the page in any order. The grouping of glyphs into strings has no significance for the display of text. Showing multiple glyphs with one invocation of a text-showing operator such as Tj produces the same results as showing them with a separate invocation for each glyph.

A possible problem is the positioning of the TJ/Tj strings. Usually, a text gets rendered in reading order: left to right, top to bottom. But items such as headers and footers, and figures or tables, may always get rendered first or last. In addition, if the text fragments are rendered per font/size, you might find (for example) all of the roman text first, then all italics text, and finally all bold text.

It's in most cases impossible to accurately extract the original text back out of a PDF. Both TJ and Tj [a] only format horizontal spans of text (actually they can render vertical text as well), and the original relation between text spans is not retained, as the virtual printer was never aware of this to begin with.

[a] There are two more text rendering commands: ' and " do the same as TJ and Tj but in addition position the 'current point' to "the start of the next line", and that, in turn, needs interpreting the value of "leading" and "start of the current line".

Another caveat is that the character encoding in the text operands may not be what you expected. A PDF printer is free to reorder or change the character encoding, such as when a font has been subsetted, or to access special characters outside the default font encoding. So you might get a string back as

[ (\251 1985\205) 6.4 (2006 A) 24 (d) 1 (o) 9.7 (b) -12.3 (e) ] TJ

(first line of page 2 of the PDF Reference 1.7). The octal characters \251 (169 in decimal) and \205 (133 in decimal) are the characters © and ; the first is also a regular ISO-Latin1 code, but the second is not -- this text is in PDFEncoding (Appendix D, Character Sets and Encoding). Encoding may differ from font to font in your document (and it's also possible you have duplicates of a font, with different encodings). The encoding may also be totally custom (using \000 for 'A', \001 for 'd', and so on) or stored as the difference with one of the standard encodings:

7 0 obj @ 319814        % Encoding
<<
  /Type         /Encoding
  /Differences  [ 32 /space 38 /ampersand 44 /comma /hyphen /period /slash /zero /one /two /three 53 /five /six /seven /eight /nine /colon /semicolon 65 /A /B /C /D
      /E /F /G /H /I 75 /K /L /M /N /O /P 82 /R /S /T /U /V /W /X 90 /Z 95 /underscore 97 /a
      /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z 133
      /endash 141 /quotedblleft /quotedblright 169 /copyright ]
>>
endobj

Addition

The PDF Reference 1.7 in itself is an interesting target. Inspecting the text on a chapter start page, page 25 ("Chapter 1 - Introduction), I found this:

25
CHAPTER 1
1Introduction
The Adobe Portable Document Format (PDF) is the native file format of the ..

The "25" is the page number at the bottom, and "CHAPTER 1" is obvious; but why "1Introduction"? Was that a decoding error? Further inspection showed the "1" is set at 1.98 pt size and with a fill color of "White" (it actually showed up when I placed a black rectangle behind the entire page). I guess this was just one of the typesetter's tricks: by including the chapter number on the same line, he could make his software (Framemaker) automatically generate the correct "Bookmark" text from that line, including the '1'. Of course, the '1' should not be visible on the page itself, so he set it small and white.

Jongware
  • 22,200
  • 8
  • 54
  • 100