Python - Open pdf file to specific page label

Question

I have read many answers that explain how to open a PDF file to a specific page; from this answer, the solution is something like this:

import subprocess
import os

path_to_pdf = os.path.abspath('C:\test_file.pdf')
# I am testing this on my Windows Install machine
path_to_acrobat = os.path.abspath('C:\Program Files (x86)\Adobe\Reader 10.0\Reader\AcroRd32.exe') 

# this will open your document on page 12
process = subprocess.Popen([path_to_acrobat, '/A', 'page=12', path_to_pdf], shell=False, stdout=subprocess.PIPE)
process.wait()

However, I have found nowhere how to open a PDF to a specific page label (in case the page number does not match the page label, that is a problem).
How can I do that?

PS: In case it is not clear, in this image you see the difference between page number (274) and page label (208).

Does this have to be for a specific PDF, or do you want this to be a generic solution? If it for a specific one then all you'd have to do is add or subtract a fixed amount from the page number you want to navigate to, but since you're posting here it seems like you want this to work on multiple PDFs with different offsets. — Random Davis, Jul 06 '21 at 15:27
It would be better if it were a generic solution. However, it might also help to know just how to get the page _label_ associated to a certain page _number_ (in order to obtain that "fixed amount") — Plato, Jul 06 '21 at 15:56

score 0 · Answer 1 · answered Feb 26 '23 at 14:14

pypdf can read the page labels:

from typing import Dict

from pypdf import PdfReader

reader = PdfReader("example.pdf")
labels = reader.page_labels

index2label: Dict[int, str] = {
    index: label for index, label in enumerate(labels)
}

This way you can find the index you're interested in. After that, you can just work with the index.

K J · Answer 2 · 2023-02-26T16:48:17.217

Page numbers in a PDF are base zer0 thus the first is often programmatically extracted by setting the counter to i=[0] then page[1] = Page 2 etc. so the number we see in a page [X of N] is itself a label by human convention.

However for journals or books we may not want to count the cover [0] but include "custom label ranges" like cover, iv, iii, ii, i, 1, 2 or A, B, C or even i, ii, iii, etc OR yet again, for an extraction 1 = 1406 from a larger annual collection.

The numbers usually follow a human logical tagged sequence, that is alien to the way PDFs are built, so we simply show the custom tag as [A] 5/9 where A is really page [4] of 9.

Or in this case its 1/1 (in truth its really [0]/[0])

So how to extract that sequence is look for a catalog entry like

/PageLabels<</Nums[0<</S/A>>]>>/Pages 5 0 R/Type/Catalog

So we can see the labels are "/Numerics" where the only page [0....] can/should be replaced by a <</String Word of /A>>

So its easy to write a # output if you know where the /Nums have been stored.

Returning to your query You can add NamedDestinations and for that page 1 (labelled A) we can search for an entry like

%Prepare Named Destinations
3 0 obj
<</Names [(Page1) [6 0 R /XYZ 0 792 null] (QRCode) [6 0 R /XYZ 25.0 317.0 1]]>>

So to open at page A in this case we can say #NamedDest=Page1 or #NamedDest=QRcode since the external labels both redirected to 6 0 R and do not have to equal any internal one which is object 6 which is Page [0] (Labelled A in this case)

BEWARE those Named Destinations are a single ID string WITHOUT spaces (URL should not have spaces) and are totally different to internal /D estinations you see in a TOC or Outline where the Page Label can be Page A or introduction or any other navigation tag/keywords.

Many users are confused by why calling a Name externally does not work when they see /D (Destination) like this

9 0 obj
<</A<</D [6 0 R /FitR 90 90 300 300]/S/GoTo>>/C[0 0 1]/F 3/Parent 4 0 R/Prev 7 0 R/Title(Hello World\041)>>
endobj

Here the destination is the QRCode destination but for human consumption in the navigator is listed as page Hello World and some readers can jump via either A or Hello World or QRCode or page = 1 by using different external - or -- or / or # switch prefix

TL;DR

So it is not possible for acrobat to jump to label

/A page=A

it can go to inputs

/A nameddest=destination e.g. QRCode
/A page=pagenum e.g. 1
/A #page=1&comment=commentID
/A search=wordList
/A #page=1&highlight=lt,rt,top,btm

but not outputs

/A CustomLable=A page by any other name

However other readers may, so as suggested for Python you need to 1st open the file, decompress the file and search for the tagging, but that defeats your whole aim of opening the file 1st at the given page label !!

Thus rather than build a look-up list, after file opening, you need/want an external look-up list OR better yet convert the internal labels to external NamedDest 's if they are not already there (but rarely are).

Python - Open pdf file to specific page label

2 Answers2

TL;DR