Retrieve Custom page labels from document with pyPdf

Question

At the moment I'm looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I'm looking into scraping each page for its page number to determine the order it should go in (e.g. if someone split up a book into 20 10-page PDFs and I want to put them back together).

I have two questions - 1.) I know that sometimes the page number is stored in the document data somewhere, as I've seen PDFs that render on Adobe as something like [1243] (10 of 150), but I've read documents of this sort into PyPDF2 and I can't find any information indicating the page number - where is this stored?

2.) If avenue #1 isn't available, I think I could iterate through the objects on a given page to try to find a page number - likely it would be its own object that has a single number in it. However, I can't seem to find any clear way to determine the contents of objects. If I run:

reader.pages[0].getContents()

This usually either returns:

{'/Filter': '/FlateDecode'}

or it returns a list of IndirectObject(num, num) objects. I don't really know what to do with either of these and there's no real documentation on it as far as I can tell. Is anyone familiar with this kind of thing that could point me in the right direction?

score 78 · Answer 1 · edited Dec 28 '22 at 22:46

78

The following worked for me:

from pypdf import PdfReader

reader = PdfReader("path/to/file.pdf")
len(reader.pages)

edited Dec 28 '22 at 22:46

Martin Thoma

124,992
159
614
958

answered Jul 29 '13 at 18:14

Josh

12,896
4
48
49

1

I had to change `pypdf` to `pyPdf` and the read type to `rb`. – Matthew Wesly Nov 05 '13 at 19:57
11

I also just noticed that this doesn't really answer the question he was asking, but it happened to be what I was looking for. (The number of pages in a pdf) – Matthew Wesly Nov 05 '13 at 20:10
2

For Python 3, I had to use the package `PyPDF2` instead. (`from PyPDF2 import PdfFileReader`) – Garrett Jan 04 '17 at 13:38

score 13 · Answer 2 · answered Nov 07 '17 at 23:55

13

The other answers use PyPDF/PyPDF2 which seems to read the entire file. This takes a long time for large files.

In the meantime I wrote something quick and dirty which doesn't take nearly as long to run. It does a shell call but I wasn't aware of any other way to do it. It can get the number of pages for pdfs that are ~5000 pages very quickly.

It works by just calling the "pdfinfo" shell command, so it probably only works in linux. I've only tested it on ubuntu so far.

One strange behavior I've seen is that surrounding this in a try/except block doesn't catch errors, you have to except subprocess.CalledProcessError.

from subprocess import check_output
def get_num_pages(pdf_path):
    output = check_output(["pdfinfo", pdf_path]).decode()
    pages_line = [line for line in output.splitlines() if "Pages:" in line][0]
    num_pages = int(pages_line.split(":")[1])
    return num_pages

answered Nov 07 '17 at 23:55

Bryant Kou

1,728
1
19
16

Just realized that the question was specifically for pypdf, but this is the first result when googling for how to get number of pages in a pdf using python, so this answer will be relevant for most. – Bryant Kou Nov 08 '17 at 01:26
+1 since this is still useful for people who just want to get the number of pages, are already using poppler-utils and don't want to add another dependency in their project. – Aristu Feb 08 '20 at 13:52
There are pre-compiled binaries for windows, too: [xpdf command line tools](http://www.xpdfreader.com/download.html) – m01010011 May 17 '20 at 01:15
Also, if you have `PyYAML` already installed you can use it to parse the data: `yaml.safe_load(subprocess.check_output(["pdfinfo", pdf_path]))['Pages']`. – m01010011 May 17 '20 at 01:22

kindall · Accepted Answer · 2012-09-11T21:53:19.607

For full documentation, see Adobe's 978-page PDF Reference. :-)

More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

In PyPDF, to get at this information, try, as a starting point:

pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

I have tried reading up, but of no use.......... Can u give a working code sample ? — dreamer, Sep 30 '15 at 07:00

score 5 · Answer 4 · edited May 23 '17 at 10:30

The answer by kindall is very good. However, since a working code sample was requested later (by dreamer) and since I had the same problem today, I would like to add some notes.

pdf structure is not uniform; there are rather few things you can rely on, hence any working code sample is very unlikely to work for everyone. A very good explanation can be found in this answer.
As explained by kindall, you will most likely need to explore what pdf you are dealing with.

Like so:

import sys
import PyPDF2 as pyPdf

"""Open your pdf"""
pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))

"""Explore the /PageLabels (if it exists)"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""Select the item that is most likely to contain the information you desire; e.g.
       {'/Nums': [0, IndirectObject(42, 0)]}
   here, we only have "/Num". """

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""If you see a list, like
       [0, IndirectObject(42, 0)]
   get the correct item from it"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""If you then have an indirect object, like
       IndirectObject(42, 0)
   use getObject()"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()
    print(page_label_type)
except:
    print("No /PageLabel object")

"""Now we have e.g.
       {'/S': '/r', '/St': 21}
   meaning roman numerals, starting with page 21, i.e. xxi. We can now also obtain the two variables directly."""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
    print(page_label_type)
    start_page = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
    print(start_page)
except:
    print("No /PageLabel object")

As you can see from the ISO pdf 1.7 specification (relevant section here) there are lots of possibilities of how to label pages. As a simple working example consider this script that will at least deal with decimal (arabic) and with roman numerals:

Script:

import sys
import PyPDF2 as pyPdf

def arabic_to_roman(arabic):
    roman = ''
    while arabic >= 1000:
      roman += 'm'
      arabic -= 1000
    diffs = [900, 500, 400, 300, 200, 100, 90, 50, 40, 30, 20, 10, 9, 5, 4, 3, 2, 1]
    digits = ['cm', 'd', 'cd', 'ccc', 'cc', 'c', 'xc', 'l', 'xl', 'xxx', 'xx', 'x', 'ix', 'v', 'iv', 'iii', 'ii', 'i']
    for i in range(len(diffs)):
      if arabic >= diffs[i]:
        roman += digits[i]
        arabic -= diffs[i]
    return(roman)

def get_page_labels(pdf):
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
    except:
        page_label_type = "/D"
    try:
        page_start = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
    except:
        page_start = 1
    page_count = pdf.getNumPages()
    ##or, if you feel fancy, do:
    #page_count = pdf.trailer["/Root"]["/Pages"]["/Count"]
    page_stop = page_start + page_count 

    if page_label_type == "/D":
        page_numbers = list(range(page_start, page_stop))
        for i in range(len(page_numbers)):
            page_numbers[i] = str(page_numbers[i])
    elif page_label_type == '/r':
        page_numbers_arabic = range(page_start, page_stop)
        page_numbers = []
        for i in range(len(page_numbers_arabic)):
            page_numbers.append(arabic_to_roman(page_numbers_arabic[i]))

    print(page_label_type)
    print(page_start)
    print(page_count)
    print(page_numbers)

pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
get_page_labels(pdf)

dataninsight · Answer 5 · 2022-05-25T10:34:22.510

Getting Page Number from doc using Python

PyMuPDF

import fitz
doc = fitz.open('source_path')
print(doc.pageCount)
# prints total page count of input PDF

PyPDF2

import PyPDF2
pdfFileObj = open('source.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
#get totalnumber of pages and page numbering in PyPDF2 starts with 0
pageObj = pdfReader.getPage(0)
pageObj.extractText()
pdfFileObj.close()

PDFinfo pdfinfo : extracts contents of Info dictionary in a PDF file. Another part of the Xpdf project.

pdfinfo filename.pdf
**Output**
Title:          HILs.pdf
Subject:
Keywords:
Author:        
Creator:        Acrobat PDFMaker 10.0 for Word
Producer:       Acrobat Distiller 9.3.0 (Windows)
CreationDate:   Mon Jun  2 11:16:53 2014
ModDate:        Mon Jun  2 11:16:53 2014
Tagged:         no
Pages:          3
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      39177 bytes
Optimized:      yes
PDF version:    1.5

PDFminer

from pdfminer.pdfpage import PDFPage
infile = file(fname, 'rb')
print(PDFPage.pagenums(infile))

score 0 · Answer 6 · answered Sep 08 '20 at 23:15

Another Option is pymupdf: https://pymupdf.readthedocs.io/en/latest/tutorial.html

import fitz

doc = fitz.open('Path To File')
doc.pageCount

pip install pymupdf

For large documents I was getting a recursion error when using pypdf2 so this was another quick and simple way.

Martin Thoma · Answer 7 · 2023-02-08T22:02:53.670

0

Total Page Count with pypdf

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(len(reader.pages))

Page Label with pypdf

Kindall and orange were on the right path. I added native support to pypdf via #1519 so you don't have to worry. You can now use it:

reader = PdfReader("example.pdf")
for index, page in enumerate(reader.pages):
    label = reader.page_labels[index]
    print(f"Page index {index} has label {label}")

edited Feb 08 '23 at 22:02

answered Dec 28 '22 at 21:40

Martin Thoma

124,992
159
614
958

Fantastic that there is official support for this. See this answer to a question about retaining page labels when merging pdfs. https://stackoverflow.com/a/61961739/7185107 . It gives you a use case for why page labels matter. I can see that some active development is happening on the github repo! – zoneparser Feb 11 '23 at 04:26
This does not work. PdfReader does not have a "page_labels" method. – max Aug 22 '23 at 01:49
You very likely installed the wrong version or the wrong package. I'm the maintainer of pypdf and I can tell you it does work. – Martin Thoma Aug 22 '23 at 05:58

Retrieve Custom page labels from document with pyPdf

7 Answers7

Getting Page Number from doc using Python

Total Page Count with pypdf

Page Label with pypdf

Linked