138

How can I inspect the structure of PDF files?

Use case: I'm trying to programmatically generate PDF files (using iText). I'm having trouble achieving certain layouts, but I have PDF files with text laid out the way I want (generated from Word). I would like to reverse engineer how they do it.

PDF Inspector seems to be good, but I'm looking for something for Windows.

Hans Brende
  • 7,847
  • 4
  • 37
  • 44
bmm6o
  • 6,187
  • 3
  • 28
  • 55
  • PDF Inspector is Java based, so multiplatform. – david.perez Apr 21 '17 at 10:50
  • 2
    Doesn't seem to run on Windows though. The jar doesn't do anything when clicked on. When called at the command line I get `no main manifest attribute, in PDF Document Inspector.jar` – Tom Apr 21 '17 at 18:29
  • 2
    @david.perez it's java based but apple wrapped so it's kinda apple only distribution. There is "PDF Document Inspector.app/Contents/Resources/Java/PDF Document Inspector.jar" jar but it's not startable as java -jar "PDF Document Inspector.jar" Also there is lot of com.apple.cocoa.* includes that are platform specific. :( – andrej Nov 13 '19 at 11:03
  • 1
    I'm using now successfully iText Rups, multiplatform and Java based. – david.perez Nov 13 '19 at 11:12
  • 4
    Ugh, it's a bit tiring that people insist on closing tickets that say "best tool" when they really mean "what should I use to do X" or " – Att Righ Sep 30 '21 at 10:30
  • 2
    Unfortunately I can't add an answer since the question is closed, but after much searching I finally found this tool: https://brendandahl.github.io/pdf.js.utils/browser/ (using pdf.js under the hood to inspect the structure of your pdf). I've had a lot of success reverse engineering pdfs with this page. – Hans Brende Mar 09 '23 at 21:20

10 Answers10

134

Besides the GUI-based tools mentioned in the other answers, there are a few command line tools which can transform the original PDF source code into a different representation which lets you inspect the (now modified file) with a text editor. All of the tools below work on Linux, Mac OS X, other Unix systems or Windows.

qpdf (my favorite)

Use qpdf to uncompress (most) object's streams and also dissect ObjStm objects into individual indirect objects:

qpdf --qdf --object-streams=disable orig.pdf uncompressed-qpdf.pdf

qpdf describes itself as a tool that does "structural, content-preserving transformations on PDF files".

Then just open + inspect the uncompressed-qpdf.pdf file in your favorite text editor. Most of the previously compressed (and hence, binary) bytes will now be plain text.

mutool

There is also the mutool command line tool which comes bundled with the MuPDF PDF viewer (which is a sister product to Ghostscript, made by the same company, Artifex). The following command does also uncompress streams and makes them more easy to inspect through a text editor:

mutool clean -d orig.pdf uncompressed-mutool.pdf

podofouncompress

PoDoFo is an FreeSoftware/OpenSource library to work with the PDF format and it includes a few command line tools, including podofouncompress. Use it like this to uncompress PDF streams:

podofouncompress orig.pdf uncompressed-podofo.pdf

peepdf.py

PeePDF is a Python-based tool which helps you to explore PDF files. Its original purpose was for research and dissection of PDF-based malware, but I find it useful also to investigate the structure of completely benign PDF files.

It can be used interactively to "browse" the objects and streams contained in a PDF.

I'll not give a usage example here, but only a link to its documentation:

pdfid.py and pdf-parser.py

pdfid.py and pdf-parser.py are two PDF tools by Didier Stevens written in Python.

Their background is also to help explore malicious PDFs -- but I also find it useful to analyze the structure and contents of benign PDF files.

Here is an example how I would extract the uncompressed stream of PDF object no. 5 into a *.dump file:

pdf-parser.py -o 5 -f -d obj5.dump my.pdf

Final notes

  1. Please note that some binary parts inside a PDF are not necessarily uncompressible (or decode-able into human readable ASCII code), because they are embedded and used in their native format inside PDFs. Such PDF parts are JPEG images, fonts or ICC color profiles.

  2. If you compare above tools and the command line examples given, you will discover that they do NOT all produce identical outputs. The effort of comparing them for their differences in itself can help you to better understand the nature of the PDF syntax and file format.

Jeroen Wiert Pluimers
  • 23,965
  • 9
  • 74
  • 154
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Any idea how I can inspect a JBIG2 Stream? E.g. a Stream that uses Filter "/Jbig2decode"? They are sadly still unreadable using these methods – SirHawrk Jun 15 '22 at 08:15
  • For mutool I recommend adding `-c`, so `mutool clean -c -d orig.pdf uncompressed-mutool.pdf`, so that each instruction in the content stream will be on a separate line so it's easier to read. – user202729 Nov 19 '22 at 02:15
69

I use iText RUPS(Reading and Updating PDF Syntax) in Linux. Since it's written in Java, it works on Windows, too. You can browse all the objects in PDF file in a tree structure. It can also decode Flate encoded streams on-the-fly to make inspecting easier.

Here is a screenshot:

iText RUPS screenshot

gkcn
  • 1,360
  • 1
  • 12
  • 23
  • 9
    `java -jar itext-rups-5.5.6.jar` -> `Exception in thread "AWT-EventQueue-0" java.lang.NoClassDefFoundError: com/itextpdf/text/Version` - How are you supposed to run this thing? Edit: Figured it out. You should not download the default file offered by SourceForge, you need to download the .jar which includes dependencies. – Zero3 Jul 13 '15 at 00:52
  • 2
    @Zero3 just came across the same thing. Thanks for your comment. – Sam Jul 13 '15 at 06:33
  • @Zero3: You should [no longer download from SF at all](http://arstechnica.com/information-technology/2015/05/sourceforge-grabs-gimp-for-windows-account-wraps-installer-in-bundle-pushing-adware/)... – Kurt Pfeifle Sep 29 '15 at 11:12
  • 1
    @KurtPfeifle I completely agree. Unfortunately, a lot of software (like this!) is only available through SourceForge because the maintainer did not move the project elsewhere yet, and might never do so. You should indeed be very careful when downloading anything from SourceForge these days... – Zero3 Sep 29 '15 at 12:57
  • @Zero3 at the time you wrote that comment, all iText related software, including RUPS, was already on GitHub for more than 6 months. There is also the official iText website, http://itextpdf.com – Amedee Van Gasse Mar 11 '16 at 07:35
  • @Zero3 the release of iText 5.5.9 is scheduled for next week and might not be offered on Sourceforge. I will put up a notice to tell people where we have moved. Unfortunately that will make some other people unhappy, but you cannot please all of the people all of the time. – Amedee Van Gasse Mar 11 '16 at 07:38
  • @AmedeeVanGasse Great! I was not aware, as I just followed the link by gkcn. I'm not sure what you mean with making other people unhappy. Abandoning SourceForge seems like the only sensible thing to do. – Zero3 Mar 11 '16 at 10:44
  • There are tons of ancient links all over the web, also on StackOverflow, that point to Sourceforge. If they point to the main project page, then it's okay and they will see the notice that I will put up. But if it is a deep link to a specific file on a specific commit, and I remove that, then people will get a 404. – Amedee Van Gasse Mar 11 '16 at 11:00
  • @AmedeeVanGasse is iText RUPS available as a compiled jar ready to use by non-developers? – iPDFdev Apr 12 '16 at 09:50
  • 7
    Yes - as a compiled jar and even as an exe, for Windows users. See http://github.com/itext/rups/releases/latest – Amedee Van Gasse Apr 12 '16 at 09:53
  • 1
    @AmedeeVanGasse the screenshot in this answer shows a view of the page (between the document tree and xref tab). How can I display that view in v5.5.9 on Windows? – iPDFdev Apr 12 '16 at 13:03
  • Please start a new question. – Amedee Van Gasse Apr 12 '16 at 13:09
  • AGPL version has no built-in renderer... – Stepan Yakovenko Jun 19 '17 at 18:38
  • 1
    for all experiencing `Exception in thread "AWT-EventQueue-0"` issue try running other jar from zipfile: `java -jar itext-rups-5.5.9-jar-with-dependencies.jar` – ReDetection Jul 04 '17 at 04:18
  • 1
    I found [PikePDF](https://pikepdf.readthedocs.io/en/latest/) to be an excellent way to get at QPDF’s functionality from Python. – andrewdotn Feb 09 '21 at 12:49
  • If you get `java.lang.UnsatisfiedLinkError: Can't load library: /usr/lib/jvm/java-11-openjdk-amd64/lib/libawt_xawt.so`, try `sudo apt-get install openjdk-11-jre` – mlissner Nov 10 '21 at 01:00
  • Current RUPS version does even allow for editing the PDF structure right from the GUI. – Jaime Hablutzel Dec 26 '21 at 22:06
24

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at https://blog.idrsolutions.com/2009/04/viewing-pdf-objects/

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
mark stephens
  • 3,205
  • 16
  • 19
  • This seems to require a plugin; at least it's not available in Acrobat Reader 9.5.5 on Linux. – Adam Spiers Dec 09 '14 at 22:44
  • 3
    @AdamSpiers, preflight dialog box is a feature of Adobe Acrobat, not Adobe Reader – IPSUS Mar 26 '15 at 13:20
  • ... and Acrobat ([formerly Acrobat Exchange](http://en.wikipedia.org/wiki/Adobe_Acrobat)) is not available for Linux :-/ – Adam Spiers Mar 26 '15 at 13:32
  • 10
    Preflight dialog box actually requires Adobe Acrobat Pro. It is not available in Adobe Acrobat Standard. – Futal Jun 26 '18 at 20:35
  • 1
    And it is a UI nightmare to actually use. – Jon Jan 07 '20 at 22:40
  • Well we do not use Adobe Acrobat - so how to inspect the PDF without it? – nfc1 Mar 19 '20 at 18:37
  • I know this is a very old thread, but I found an [online PDF inspector](https://pdfux.com/inspect-pdf/), which allows you to browse the PDF structure in a way very similar to how Adobe does it. It is _slightly_ less powerful than Adobe, but it's free and online, so might still be useful for somebody… – FurloSK Mar 11 '21 at 10:23
9

PDFXplorer from O2 Solutions does an outstanding job of displaying the internals if you're on a Windows machine.

http://www.o2sol.com/pdfxplorer/overview.htm

(Free, distracting banner at the bottom).

mlissner
  • 17,359
  • 18
  • 106
  • 169
Pierre
  • 4,114
  • 2
  • 34
  • 39
9

If you're on Windows, PDF Analyzer is similar to PDFXplorer, but it has more options. It is also free after a single registration.

enter image description here

mlissner
  • 17,359
  • 18
  • 106
  • 169
juFo
  • 17,849
  • 10
  • 105
  • 142
  • For me PDFXplorer works much better, because it goes deeper into the contents. – Daniel May 17 '21 at 05:09
  • @Daniel how do you mean, in the tree? I like the fact that PDFAnalyzer can show text and can dump images. – juFo May 18 '21 at 14:27
  • I compared PDFxplorer and PDF Analyzer and PDFXplorer lets me dig down a bit deeper into the internal structures of the streams than PDF Analyzer. – Daniel May 22 '21 at 01:23
  • For people reading this that want to try PDF Analyzer, you don't need to register into their site just fill the names and emails with anything and click "Register my free copy" but make sure to block the application from accessing Internet through your firewall, or disable Internet while registering the application. – churchill Jun 01 '21 at 18:34
8

There is also another option. Adobe Acrobat Pro is also able to display the internal tree structure of the PDF.

  1. Open Preflight
  2. Go to Options (right upper corner)
  3. Internal PDF Structure

On top Adobe Acrobat Pro can also display the internal structure of the Document Fonts in the PDF most of other "PDF tree structure viewer" don't have this otion

enter image description here

Vad1mo
  • 5,156
  • 6
  • 36
  • 65
  • 3
    This is what @mark-stephens describes in the accepted answer. – koppor Mar 06 '18 at 13:35
  • 6
    @mark-stephens' answer just links to a blog post that might disappear in the future (and is discouraged on SO). vadimo's actually provides the answer. – Starfish Dec 26 '18 at 18:24
5

I've used PDFBox with good success. Here's a sample of what the code looks like (back from version 0.7.2), that likely came from one of the provided examples:

// load the document
System.out.println("Reading document: " + filename);
PDDocument doc = null;                                                                                                                                                                                                          
doc = PDDocument.load(filename);

// look at all the document information
PDDocumentInformation info = doc.getDocumentInformation();
COSDictionary dict = info.getDictionary();
List l = dict.keyList();
for (Object o : l) {
    //System.out.println(o.toString() + " " + dict.getString(o));
    System.out.println(o.toString());
}

// look at the document catalog
PDDocumentCatalog cat = doc.getDocumentCatalog();
System.out.println("Catalog:" + cat);

List<PDPage> lp = cat.getAllPages();
System.out.println("# Pages: " + lp.size());
PDPage page = lp.get(4);
System.out.println("Page: " + page);
System.out.println("\tCropBox: " + page.getCropBox());
System.out.println("\tMediaBox: " + page.getMediaBox());
System.out.println("\tResources: " + page.getResources());
System.out.println("\tRotation: " + page.getRotation());
System.out.println("\tArtBox: " + page.getArtBox());
System.out.println("\tBleedBox: " + page.getBleedBox());
System.out.println("\tContents: " + page.getContents());
System.out.println("\tTrimBox: " + page.getTrimBox());
List<PDAnnotation> la = page.getAnnotations();
System.out.println("\t# Annotations: " + la.size());
Kaleb Pederson
  • 45,767
  • 19
  • 102
  • 147
4

The object viewer in Acrobat is good but Windjack Solution has a plugin for Acrobat called PDF Canopener that allows better inspection with an eyedropper for selecting objects on page. Also permits modifications to be made to PDF.

https://www.windjack.com/product/pdfcanopener/

mlissner
  • 17,359
  • 18
  • 106
  • 169
Dwight Kelly
  • 1,232
  • 2
  • 11
  • 17
1

If you want to work programmatically from within Python, pdfminer is a good option. It allows you to work with PDF structure in memory as an object hierarchy or serialize it as XML.

W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
-8

My sugession is Foxit PDF Reader which is very helpful to do important text editing work on pdf file.

nifCody
  • 2,394
  • 3
  • 34
  • 54
  • 8
    I couldn't find any way in Foxit Reader to view the internal structure of a PDF similar to PDF Inspector (referenced in the question) – bmaupin Feb 12 '17 at 20:48