How to extract proper page number of a pdf using python

Question

The code extracts the page number that is mentioned below every page, but I need the actual page number which is the file page number, not the document page number. I have also attached the screenshot and marked the page number in red that needs to be extracted. Please look into it.

Here is the code I have tried.

import PyPDF2
import re

obj = PyPDF2.PdfFileReader(r"avnet_202209 (1).pdf")

pgno = obj.getNumPages()

S = "Basis of presentation and new accounting pronouncements"

for i in range(0, pgno):
    PgOb = obj.getPage(i)
    Text = PgOb.extractText()
    if re.search(S,Text):
         print("String Found on Page: " + str(i))

The output was : String Found on Page: 7 String Found on Page: 22

Required output: String Found on Page: 8 String Found on Page: 23

Does this answer your question? [Retrieve Custom page labels from document with pyPDF](https://stackoverflow.com/questions/12360999/retrieve-custom-page-labels-from-document-with-pypdf) — Ron, Dec 21 '22 at 08:07
@AKX But how does the code identify itself that doc page number (mentioned at below) start from a specific page — Anil Soren, Dec 21 '22 at 08:13
@Ron Not exactly. Those codes shows total number of pages in the pdf — Anil Soren, Dec 21 '22 at 08:14
I would say that the document would always start at page `1`. Do you have any examples that show otherwise ?? — ScottC, Dec 28 '22 at 16:07

score 0 · Answer 1 · answered Dec 31 '22 at 00:19

There are three things you could mean.

I assume you install pypdf. I became the maintainer of pypdf and PyPDF2. We only continue development of pypdf and pypdf already has all of the features of PyPDF2.

1. Page Index

That is what you start with. If you iterate over the pages, you just need to save the index:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for index, page in enumerate(reader.pages)
    ...

For example, when I want to print something I need to enter (index+1).

2. Total page count

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(f"The PDF document has {len(reader.pages)} pages in total")

3. Page label

For example, this file has the first few pages being labeled as i, ii, ...

Those are called "page labels".

I've recently added a PR to add support in pypdf

I will make a release latest on Sunday (01.01.2023). You can then do this:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for index, page in enumerate(reader.pages)
    print(f"index={index}: label={reader.page_labels[index]}")

How to extract proper page number of a pdf using python

1 Answers1

2. Total page count

3. Page label