12

I'd like to export the page-labels stored in some PDF documents for easy parsing. I know I could dig into the PDF document after having it converted with qpdf, but this seems like overkill.

Is there no commandline tool that will simply print the page label for each page (or together with other meta-data)? I know that PDFSpy will export the label, but $300 isn't an option, preferably the solution should be free.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
grovel
  • 643
  • 1
  • 6
  • 11

2 Answers2

15

Short answer:
I am not aware of any (free) tool that can 'simply print' the page label for each page.

Also, you'll not be able to evade the expansion compressed objects and object streams, using a tool like qpdf or one with equivalent capabilities.

Long answer:
There's no such tool because these are the only a few things you can safely rely on when it comes to page labels. These are the following:

  1. Each PDF document must contain a root object.
  2. That root object must be of /Type /Catalog.
  3. The document's trailer will show where to find the object using the key /Root followed by the indirect object number reference.
  4. IF a PDF document uses non-standard page labels, then the document root object must have an entry named /PageLabels.

Here is where it stops to be relatively easy. Because the object the /PageLabels key refers to may be contained in a compressed object stream. This means that you'd have to expand that object stream.

If you really succeeded to get the description of the page labels as ASCII, you'll discover that it's not an easily parseable flat list (like a dictionary is): it is a number tree.

I'll not go into the details of these complexities, because it would take a very long article to describe all possible variations. You better read it up directly in the official ISO PDF-1.7 specification.

But instead I'll give you an example in ASCII PDF code:

213 0 obj
  << /Type /Catalog
     /PageLabels 
        << 
           /Nums 
                 [ 
                   0 <<           % start labeling from page no. 1
                       /S /r      % label with lowercase roman numbers
                     >> 
                   7 <<           % start new labeling from page no. 8
                       /S /D      % label with standard decimal numbers
                     >> 
                   11 <<          % start labeling page no. 12
                       /S /D      % label with decimal numbers...
                       /P (ABCD-) %   ...but using label prefix 'ABCD-'...
                       /St 3      %   ...followed by '3' as the start decimal.
                     >>
                  ]
        >>
     %%...........................
     %%...more root object keys...
     %%........................... 
  >>
endobj

The above example will label the pages number 1, 2, 3, ... (last) like this:

i
ii
iii
iv
v
vi
1
2
3
4
ABCD-3
ABCD-4
ABCD-5
ABCD-6
...and so on until last page...

As you can see, the PDF method of labeling pages (mapping page numbers to page names) is completely non-intuitive. You can only understand it by studying the PDF specification.

Andrew
  • 36,541
  • 13
  • 67
  • 93
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thanks for this excellent summary of the situation. I'd found out about most of this before. I realized that it would either be my own mini-parser, or someone else had written it before (which I was hoping). I would be happy to calculate them myself from the information in the root-object, but unfortunately, the root-object is not always easy to find in a simple jscript implementation (which I wanted to use). QPDF easily gives me access to the page-objects, but there's no way of asking it to return the trailer or the root-object directly, hence no way to know where to look for the Catalog – grovel Oct 17 '12 at 07:55
  • 1
    Ok, after further digging, I've actually found a rather simple solution: PDFtk (which I had looked at before, but this feature is poorly documented). – grovel Oct 17 '12 at 08:36
  • 11
    `pdftk.exe document.pdf dump_data output report.txt` will result in a txt-file which lists not only meta-data such as bookmarks, but also the page labels. It will look like this: `PageLabelNewIndex: 1 PageLabelStart: 1 PageLabelPrefix: C PageLabelNumStyle: DecimalArabicNumberals PageLabelNewIndex: 3 PageLabelStart: 1 PageLabelNumStyle: LowercaseRomanNumerals PageLabelNewIndex: 15 PageLabelStart: 1 PageLabelNumStyle: DecimalArabicNumerals` i.e. C1,C2,i,ii,...,xiii,1,2,... Easy to parse, exactly what I need. @Kurt, thanks anyway, much appreciated! – grovel Oct 17 '12 at 08:47
  • 1
    @grovel: Oooh yesssss, good-ol' pdftk! Now I remember. Yes, I have even used myself pdftk for this some years ago. At the time however, it wasn't reliably working for PageLabel info, maybe that's why I forgot about it again. Good on you to have re-discovered this feature again for me. Will test it too. – Kurt Pfeifle Oct 17 '12 at 18:39
  • @grovel, this comment about pdftk deserves to be a separate answer :). – Sasha Dec 20 '20 at 10:24
  • @grovel, but this doesn't work for me :(. Neither in pdftk 2.02-4 nor in pdftk-java 3.0.9-1. These lines seem to be just ignored by `update_info_utf8` and are never produced by `dump_data_utf8`. – Sasha Dec 20 '20 at 11:44
  • 1
    @Sasha, you may want to check out my new answer below. Like you, I've found that pdftk doesn't always do the job. – mheim Mar 08 '21 at 16:25
3

I've written a small command-line utility based on Poppler that does just this task: https://github.com/HeimMatthias/pdfpagelabels

Disclaimer: I'm the OP and created the original post under a different account. I have been using the solution via pdftk (listed in a comment above) successfully for years in my implementation. However, last year it was time to reimplement our system from scratch and we've had numerous instances where the pdf-tk output could not be parsed by our implementation.

The new command-line tool follows the philosophy of doing just one thing, but doing it well, and simply prints the page labels of all or selected pages of a pdf-file. If anyone finds this useful, and stumbles upon it here, all the better for it.

mheim
  • 366
  • 1
  • 12