2

Background

According to the pdf standard an updated file can contain different Versions. If you take a look at Figure 3 in 7.5.6 you will see that there is an original body + Metadata (trailer, Crossreference section) and multiple updates.

Problem

Is there any library that allows extracting these revisions? Is this possible with poppler (I am using poppler anyway for most of the work)?

An Api that gives the number of revisions and allows to extract them would be nice. If I understand the standard correctly this is simply a matter of cutting of the updates part and adding an updated startxref at the end of the file.

Note

While this seems easy to roll myself, I would prefer to reuse something existing before I resort to writing my own.

@CloseVotes: While I could have phrased the question as How to extract PDF revisions with Poppler, I wanted to keep it broad since I prefer using an additional library over hacking my own.

ted
  • 4,791
  • 5
  • 38
  • 84
  • 1
    **A** Transitions between revisions are not always as clear-cut as one would like, cf. the *Some backgrounds first* section of [this answer](http://stackoverflow.com/a/17190063/1729265), especially the remarks on "forms in-between" and "some imprecision". **B** As also explained in that answer, signed revisions can exactly be recognized. To extract them, you merely need to inspect the signed byte ranges entries and cut the original file accordingly – mkl Dec 03 '14 at 09:53
  • @mkl That post was quite helpful, as I gather from there and the answer below, it is the best to roll my own tool to do the extraction. – ted Dec 03 '14 at 11:57
  • *to roll my own tool* - when doing so, please be aware that there are some traps, some of them indicated in the answer referenced above but some also related to linearized and hybrid PDFs. – mkl Dec 03 '14 at 12:06
  • @mkl Would you mind enlightening me about hybrid pdf's? Also I glanced at linearization in Appendix F and have the feeling that it will be uninteresting to me. If I understand correctly a linearized document does not have incremental updates, thus I can just return it. If I have a document which has a lineraized 'original' body and updates, I can just trim down version by version till I (poppler) detects that it is seeing a linearized document. Is this understanding about linearization sufficient/correct? – ted Dec 03 '14 at 13:53
  • 1
    *If I understand correctly a linearized document does not have incremental updates, thus I can just return it* - it does not but it is constructed similarly to documents with one incremental update. If you stop as soon as you see a correct linearized PDF, you should be ok. – mkl Dec 03 '14 at 16:57
  • *enlightening me about hybrid pdf's* - I meant hybrid-reference PDFs with both a cross reference table and a cross reference stream. I've seen funnier constructs for that than those in the PDF specification. I'd have to dig for them, though. – mkl Dec 03 '14 at 17:04

1 Answers1

1

The technical term of the PDF feature you are talking about is 'incremental update'. Now can you discover, if a PDF document was incrementally updated, and thusly contains different document versions?

Using a command line tool, pdfresurrect

There is a command line tool, pdfresurrect, which can do what you want. First, it can list the number of different versions contained in a PDF document. Example:

kp@mbp:> pdfresurrect -q incrupd.pdf
 incrupd.pdf: 2

Second, it can reveal a few more details about the changes between the versions:

kp@mbp:> pdfresurrect incrupd.pdf incrupd.pdf: --A-- Version 1 -- Object 0 (Stream) incrupd.pdf: --A-- Version 1 -- Object 1 (Catalog) incrupd.pdf: --A-- Version 1 -- Object 2 (Unknown) incrupd.pdf: --A-- Version 1 -- Object 3 (Pages) incrupd.pdf: --A-- Version 1 -- Object 4 (Page) incrupd.pdf: --A-- Version 1 -- Object 5 (Stream) incrupd.pdf: --A-- Version 1 -- Object 6 (ExtGState) incrupd.pdf: --A-- Version 1 -- Object 7 (Font) incrupd.pdf: --A-- Version 1 -- Object 8 (Unknown) incrupd.pdf: --A-- Version 1 -- Object 9 (Unknown) incrupd.pdf: --D-- Version 2 -- Object 0 (Stream) incrupd.pdf: --M-- Version 2 -- Object 5 (Stream) ---------- incrupd.pdf ---------- Versions: 2 Version 1 -- 10 objects Version 2 -- 2 objects

Third, it can write all versions to disk (creating a subdirectory in the current one), so you can inspect them one by one:

kp@mbp:> pdfresurrect -w incrupd.pdf

kp@mbp:> ls -ltr incrupd-versions/ total 24 -rw-r--r-- 1 kurtpfeifle staff 695 Dec 3 10:44 incrupd-versions.summary -rw-r--r-- 1 kurtpfeifle staff 3713 Dec 3 10:44 incrupd-version-2.pdf -rw-r--r-- 1 kurtpfeifle staff 3857 Dec 3 10:44 incrupd-version-1.pdf

Fourth, it can scrub the previous versions from the PDF document and keep only the latest:

kp@mbp:> pdfresurrect -s incrupd.pdf

kp@mbp:> ls -l incrupd*.pdf
 -rw-r--r--@ 1 kurtpfeifle  staff  3491 Dec  3 10:43 incrupd.pdf
 -rw-r--r--  1 kurtpfeifle  staff  3201 Dec  3 10:49 incrupd-scrubbed.pdf

Using a text editor

If you know how to handle a text editor when it comes to (partially) binary files, you can also proceed like this:

  1. Backup your PDF.
  2. Open the backup PDF in the editor.
  3. Go to the end of the file.
  4. Search for the last occurrence of %%EOF. (In a well-behaved PDF, this should be right at the end, without any garbage following after.)
  5. Search for the last-but-one occurence of %%EOF.
    • Delete everything after the last-but-one %%EOF up to the very end of the file.
    • Save the file under a new name (preferrably containing -version2.pdf).

Congratulations -- you've just restored the previous version of the PDF document. :-)

Continue above procedure to restore even older versions...

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345