1

I have a pdf of which the content stream of the pdf doc looks like image1.

enter image description here

But once I open the pdf in adobe dc and tried to change the reading order. The entire content stream is changed. (Please see image2)

enter image description here

And here is the link to source pdf https://drive.google.com/file/d/1V2K3-2GdWG5DuTUv1fyfIIT54en70kI2/view

Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)

Thanks in advance !

General Grievance
  • 4,555
  • 31
  • 31
  • 45
SuperNova
  • 25,512
  • 7
  • 93
  • 64
  • I'm not sure if I understand your question, I assume you want to change the reading order, "proper content stream" doesn't mean much. What you could do is to use the `WriteDecodedDoc` command line utility. Then open your file with NOTEPAD++ and change the reading order by switching these blocks, but take care not to insert or delete anything, i.e. the stream start and end positions must be the same. Then open the file with Adobe Reader and save it so that it gets compressed. – Tilman Hausherr Jul 25 '19 at 15:08

2 Answers2

2

Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)

First of all, both streams are proper, there merely are different (and in the case at hand considerably different) ways to create the same text on screen, each of them as valid as each other, and different PDF processors use different ways.

The processor that created your original PDF appears to have approached the task by dividing the text in small pieces (less than a text line) and draw these pieces as independently as possible, i.e. as separate text objects (BT..ET) with text properties set in each (Tm, Tf, Tc), positioned and rescaled by transformation matrix changes (cm), enveloped in save/restore graphics state instructions (q..Q).

Adobe Acrobat, on the other hand, appears to prefer the page main text to be contained in a single text object with text properties only set when they change and no text object or graphics state switches in-between.

Neither of these is more "proper" or more "graphical" than the other. If anything, these structures mirror how these instructions are stored or processed internally by the respective PDF processor.

That being said, you do want to convert from the former style into the latter.

The main problem is that the latter style is not standardized (at least there is no published document normatively describing it). So, while you can surely attempt to follow the lead of the example you have, you can never be sure that you understood the style exactly. Thus, you always have to expect differences emerging in special, not yet encountered situations. Furthermore, there is no guarantee Adobe will meticulously adhere to that style across software versions.

Nonetheless, you can of course attempt to follow the style (as you perceive it) as well as possible.

An implementation will have to walk through the respective content stream, keeping track of the current graphics state, and transform the text drawing (and related) instructions into a single text object for as long as possible.

You have tagged your question both and . Thus, you appear to be undecided with which PDF library to implement this. Here some ideas for both choices:

  • For processing content streams and keeping track of the current graphics state, iText offers its com.itextpdf.text.pdf.parser API, in particular the PdfContentStreamProcessor (iText 5.x) / its com.itextpdf.kernel.pdf.canvas.parser API, in particular the PdfCanvasProcessor (iText 7.x).

    You can extend them to in addition to analyzing the current contents also replace the content stream in question with an updated version, e.g. like I did in this answer for iText 5 or in this answer for iText 7.

  • PDFBox for the same task offers the class hierarchy based on the PDFStreamEngine. Based on these classes it should similarly be possible to create a graphics state aware content stream editor.

Both libraries also offer simpler classes for parsing the content streams into sequences of instructions, but those classes don't keep track of the graphics state, leaving that for you to implement.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
0

The Question is unclear, in that as described by @mkl both are the same contents just different ways of setting and thus viewing the lines of text.

This query is caused by the way lines of text are broken into sub units, such that the "Lines" are often multiple parts as text "blocks" with what seem to be separated characters (which is called "kerning").

So in this case that kerning is generally not needed, let me show you one alternate way of writing those lines. Here is the first "paragraph" but note there is no such distinction in a PDF each line is in effect "standalone" (or several standalones :-)

NOTE these are first FIVE lines of text placed as NINE blocks (two are just simply a space character, in the same line.)

q 0.24 0 0 0.24 50.05992 729.02 cm BT 0.0029 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (be) 1 ( ) 2 (t) 2 (r) 1 (ue) 1 (.) 2 ( ) 2 ( ) 2 (C) -1 (e) 1 (r) 1 (t) 2 (a) 1 (i) 2 (nl) 2 (y ) 2 (t) 2 (he) 1 (r) 1 (e) 1 ( ) 2 (a) 1 (r) 1 (e) 1 ( ) 2 (c) 1 (r) 1 (i) 2 (t) 2 (i) 2 (c) 1 (a) 1 (l) 2 ( ) 2 (a) 1 (s) 1 (pe) 1 (c) 1 (t) 2 (s) 1 ( ) 2 (of) 1 ( ) 2 (t) 2 (he) 1 ( ) 2 (t) 2 (he) 1 (or) 1 (y ) 2 (l) 2 (e) 1 (f) 1 (t) 2 ( ) 2 (unt) 2 (e) ] TJ ET Q 
q 0.24 0 0 0.24 414.2992 729.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (st) 1 (ed) -1 ( ) 1 (i) 1 (n) -1 ( ) 1 (o) -1 (u) -1 (r) 1 ( ) 1 (st) 1 (u) -1 (d) -1 (y) -1 (.) 1 ( ) 1 ( ) 1 (W) -3  (h) -1 (er) 1 (e,) 1 ( ) ] TJ ET Q 
q 0.24 0 0 0.24 50.05992 711.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (fo) -1 (r) 1 ( ) 1 (ex) -1 (am) -2 (p) -1 (l) 1 (e,) 1 ( ) 1 (i) 1 (s ) 1 (t) 1 (h) -1 (er) 1 (e ) 1 (an) -1 (y) -1 ( ) 1 (d) -1 (i) 1 (r) 1 (ect) 1 ( ) 1 (ev) -1 (i) 1 (d) -1 (en) -1 (ce ) 1 (o) -1 (f ) 1 (Òau) -1 (r) 1 (asÓ?) 2 ( ) 1 ( ) 1 (A) -2 (l) 1 (t) 1 (h) -1 (o) -1 (u) -1 (g) -1 (h) -1 ( ) 1 (t) 1 (h) -1 (ey) -1 ( ) 1 (ar) 1 (e ) 1 (t) 1 (h) -1 (eo) -1 (r) 1 (i) 1 (zed) -1 ( ) 1 (t) 1 (o) -1 ( ) 1 (b) -1 (e ) 1 (t) 1 (h) -1 (e ) ] TJ ET Q 
q 0.24 0 0 0.24 50.05992 693.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (cr) 1 (i) 1 (t) 1 (i) 1 (cal) 1 ( ) 1 (l) 1 (i) 1 (n) -1 (k) -1 ( ) 1 (b) -1 (et) 1 (w) -2 (een) -1 ( ) 1 (t) 1 (h) -1 (e ) 1 (ci) 1 (g) -1 (ar) 1 (et) 1 (t) 1 (e ) 1 (an) -1 (d) -1 ( ) 1 (ar) 1 (m) -2 ( ) 1 (st) 1 (r) 1 (en) -1 (g) -1 (t) 1 (h) -1 (,) 1 ( ) 1 (w) -2 (e ) 1 (h) -1 (av) -1 (e ) 1 (o) -1 (b) -1 (t) 1 (ai) 1 (n) -1 (ed) -1 ( ) 1 (n) -1 (o) -1 ( ) 1 (ev) -1 (i) 1 (d) -1 (en) -1 (ce ) 1 (t) 1 (h) -1 (at) 1 ( ) ] TJ ET Q 
q 0.24 0 0 0.24 50.05992 675.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (Òn) -1 (eg) -1 (at) 1 (i) 1 (v) -1 (el) 1 (y) -1 ( ) 1 (i) 1 (n) -1 (t) 1 (er) 1 (act) 1 (i) 1 (n) -1 (g) -1 ( ) 1 (au) -1 (r) 1 (asÓ ) 1 (ex) -1 (i) 1 (st) 1 ( ) 1 (o) -1 (r) 1 ( ) 1 (p) -1 (l) 1 (ay) -1 ( ) 1 (an) -1 (y) ] TJ ET Q 
q 0.24 0 0 0.24 313.6195 675.02 cm BT 58 0 0 58 0 0 Tm /F4.0 1 Tf ( ) Tj ET Q 
q 0.24 0 0 0.24 317.1195 675.02 cm BT 0.0019 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (ro) -1 (l) 1 (e) -1 ( ) 1 (i) 1 (n) -1 ( ) 1 (t) 1 (h) -1 (e) -1 ( ) 1 (p) -1 (ro) -1 (c) -1 (e) -1 (s) -1 (s) -1 (.) ] TJ ET Q 
q 0.24 0 0 0.24 422.8646 675.02 cm BT 58 0 0 58 0 0 Tm /F4.0 1 Tf ( ) Tj ET Q 
q 0.24 0 0 0.24 100.0599 657.02 cm BT 0.0053 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (Mo) 2 (s) 3 (t) 4 ( ) 4 (c) 3 (r) 3 (i) 4 (t) 4 (i) 4 (c) 3 (a) 3 (l) 4 (l) 4 (y) 2 (,) 4 ( ) 4 (h) 2 (o) 2 (w) 1 (e) 3 (v) 2 (e) 3 (r) 3 (,) 4 ( ) 4 (w) 1 (e) 3 ( ) 4 (n) 2 (e) 3 (e) 3 (d) 2 ( ) 4 (t) 4 (o) 2 ( ) 4 (r) 3 (e) 3 (c) 3 (a) 3 (l) 4 (l) 4 ( ) 4 (t) 4 (h) 2 (e) 3 ( ) 4 (p) 2 (r) 3 (o) 2 (p) 2 (o) 2 (s) 3 (i) 4 (t) 4 (i) 4 (o) 2 (n) 2 (a) 3 (l) 4 ( ) 4 (r) 3 (e) 3 (a) 3 (s) 3 (o) 2 (n) 2 (i) 4 (n) 2 (g) 2 ( ) 4 (t) 4 (h) 2 (a) 3 (t) 4 ( ) 4 (f) 3 (o) 2 (r) 3 (m) 1 (s) 3 ( ) ] TJ ET Q

enter image description here

Here we see in an editor the 9 content lines on the left, and I have highlighted on right the 8th line at end of the 4th visible printer output.
enter image description here

We can simplify that down to 5 lines (without kerning) to say this format:-

q 0.24 0 0 0.24 50.05992 729.02 cm BT 0.0029 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (be true.  Certainly there are critical aspects of the theory left untested in our study.  Where, ) ] TJ ET Q
q 0.24 0 0 0.24 50.05992 711.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (for example, is there any direct evidence of ÒaurasÓ?  Although they are theorized to be the ) ] TJ ET Q
q 0.24 0 0 0.24 50.05992 693.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (critical link between the cigarette and arm strength, we have obtained no evidence that ) ] TJ ET Q
q 0.24 0 0 0.24 50.05992 675.02 cm BT 0.0024 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (Ònegatively interacting aurasÓ exist or play any role in the process. ) ] TJ ET Q 
q 0.24 0 0 0.24 100.0599 657.02 cm  BT 0.0053 Tc 58 0 0 58 0 0 Tm /F4.0 1 Tf [ (Most critically, however, we need to recall the propositional reasoning that forms ) ] TJ ET Q

so what would be the effect ? In this case there would be no problem however let me show you the way it minimally alters content for a large reduction in file size.

Here the First Line and a Half have been replaced by the condensed FIVE Lines and the Red second Half is the old (RED) underlaid with the new.
There is no real discernible difference in final placement. So the 3rd and 4th lines are old on top of new and again there is such small difference, that I defy a "Person on a galloping horse" or an average reader, to see the displaced characters in last words "that" and "process".

enter image description here

However with the 5th Line we can see a difference, but does it matter? Let me remove the old bloated one. Would you really know it was the shorter one that went ? And the FileSize is 2084 bytes shorter, as a result from cleaning up three and a half line lengths !

enter image description here

Answer

Simply remove kerning between ) -4 ( ... to ... ) 4 ( will in most cases reduce file size significantly without too much degradation of appearance. However you need to check in cases where line scaling is impacted by removing those tweaking twips.

K J
  • 8,045
  • 3
  • 14
  • 36