23

Most PDF files found on the Web have compressed and unreadable data streams. Is it possible to uncompress the internal content of a PDF file using Acrobat or Acrobat Distiller, allowing us to read the source code by a text editor?

P.S. This question is inspired by this answer which explains how it can be done with GhostScript.

Community
  • 1
  • 1
Alexey Popkov
  • 9,355
  • 4
  • 42
  • 93
  • What do you want to read in the editor? The operators used to draw something? Or also the text? – mkl Sep 16 '13 at 04:45
  • @mkl I want to read the operators used to draw vector figures. – Alexey Popkov Sep 16 '13 at 04:52
  • 1
    While I don't see how to do that using Acrobat (I only have version 9.5 at my hands, though), it is fairly easy to do that in a small Java or .Net program using iText or iTextSharp by reading a PDF and re-saving it without compression, cf. the method `decompressPdf` in [HelloWorldCompression.java](http://itextpdf.com/examples/iia.php?id=218) / [HelloWorldCompression.cs](http://kuujinbo.info/iTextInAction2Ed/index.aspx?ch=Chapter12&ex=HelloWorldCompression). – mkl Sep 16 '13 at 08:31

3 Answers3

27

qpdf and pdftk have already been mentioned. To show the commands:

$ qpdf --qdf --object-streams=disable orig.pdf uncompressed-orig.pdf
$ pdftk orig.pdf output uncompressed-orig.pdf uncompress

mutool however hasn't been mentioned yet:

$ mutool clean -d -a orig.pdf uncompressed-orig.pdf

mutool is a command line tool which ships alongside the lightweight MuPDF PDF + document viewer.

I do not think you can achieve the uncompressing of PDF objects' streams with Acrobat or Distiller (unless you have additional payware plugins available).

Thomas W
  • 14,757
  • 6
  • 48
  • 67
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Are you sure that for `qpdf` the option `--object-streams=disable` is a good choice? According to the [documentation](http://qpdf.sourceforge.net/files/qpdf-manual.html#ref.advanced-transformation) this option means "don't write any object streams." Will not the streams be erased as a result? – Alexey Popkov May 07 '15 at 16:56
  • @AlexeyPopkov: Yes, I'm pretty sure it is a good choice for the purpose. I'm using it daily. ***IF*** object streams are enabled, a lot of the smaller objects will be embedded into another object's stream, which makes it more complex to analyse, even if un-compressed. If you don't believe me, try it yourself. (You need an input file that has at least 1 object of `/Type /ObjStm`). Disabling object-streams will unpack all these streamed objects and put them properly into their own indirect objects again, individually. – Kurt Pfeifle May 07 '15 at 17:08
  • Do you mean that for `qpdf` seemingly obvious choice `--stream-data=uncompress` will change the structure of file and complicate it? – Alexey Popkov May 07 '15 at 17:15
  • @AlexeyPopkov: The `--qdf` mode already implicitely implies `--stream-data=uncompress`. And yes, using QPDF does change the structure of the file in some way. But it tries to do so in a content-preserving way. The self-description of QPDF even tells so, stating it being a *"CLI tool that does structural, content-preserving transformations on PDF files"*. (In which cases the contents change in an unwanted and unexpected way is a different matter. I've filed a few bug reports/enhancement requests about these: for example OCGs ("layers") get flattend and incremental update history gets lost.) – Kurt Pfeifle May 07 '15 at 17:36
  • From the QPDF documentation it looks like that the `--qdf` mode creates a very-very special version of PDF file which is *editable* what is not supposed by developers of PDF and for this reason the `--qdf` mode can expectedly corrupt the original file in some way. I appreciate this effort but I'm still unsure whether the `--qdf` mode gives any benefits for *readability* of the PDF code (in this thread I'm not interested in *editability*). – Alexey Popkov May 07 '15 at 18:06
  • @AlexeyPopkov: It's good U read the docu *before* starting 2 use QPDF; I did the same, back in the days. Feel free 2 do whatever you want. I'm just sharing my knowledge + experience here. I hope you'll do the same once you learned + know more (or other) things about PDFs + related tools than I do. Whatever you finally decide for as a tool to give you the readability of PDF code: you have to compare each of it against the others first. I really hope you'll put up a writeup somewhere on the 'Net describing + weighing advantages as well as disadvantages of each tool. I'd be your first reader !! – Kurt Pfeifle May 07 '15 at 18:44
18

Use cpdf:

cpdf -decompress in.pdf -o out.pdf

and then the graphic operators for each page can be read in a text editor. You'll need a copy of the standard as a reference, though.

Disclosure: I am the author of cpdf.

Cody Piersall
  • 8,312
  • 2
  • 43
  • 57
johnwhitington
  • 2,308
  • 1
  • 16
  • 18
7

This is easy with qpdf and pdftk.

With Adobe Acrobat you can get at the internal structure after profiling a PDF (preflight with some profile (e.g. detect PDF syntax errors), then Options->Internal PDF structure) - but there's no way to get something editable with a text editor.

Martin Schröder
  • 4,176
  • 7
  • 47
  • 81
  • 1
    I need to covert a PDF into something **readable** with a text editor. Is it possible with Acrobat? – Alexey Popkov Sep 16 '13 at 04:41
  • @AlexeyPopkov: You can export into e.g. XML. But _editable_: no. – Martin Schröder Sep 16 '13 at 06:27
  • Exporting to XML gives result similar to exporting to TXT: only textual elements are included. I need to read the operators used to draw vector figures in the PDF. – Alexey Popkov Sep 16 '13 at 06:40
  • +1 Thanks for `Options->Internal PDF structure` in Preflight. It would be ideal to copy its content to a text editor for further investigation. BTW, there is no need for profiling to see `Internal PDF structure`: it works from the start (at least in Acrobat 11). – Alexey Popkov Sep 16 '13 at 10:42
  • 1
    @AlexeyPopkov: *" I need to read the operators used to draw vector figures in the PDF"*. In that case look for uncompressed `/Contents` objects and their streams. Inside the expanded streams, also look for `/name Do` operations -- these may point to XRef objects named `/name` containing vector elements (as well as point to raster image objects). – Kurt Pfeifle May 07 '15 at 18:46
  • For a given *.pdf file Acrobat Pro DC provides and Export To Function that provides a variety of alternative formats, one of which is PostScript, PostScript is the only likely option that would provide the operators. However, I haven't used PostScript, except as a stand-alone language, since shortly after it was first invented. A quick glance at the output for one page shows export provides ASCII readable Postscript output. If one can simulate/interpret the operators such as "pop", "{get exec}bdf" etc, this might be as close as you get to the code generating vector or raster graphics. – Stuart Poss Feb 15 '19 at 14:49