0

The code extracts the page number that is mentioned below every page, but I need the actual page number which is the file page number, not the document page number. I have also attached the screenshot and marked the page number in red that needs to be extracted. Please look into it. marked page number in red needs to be extracted

Here is the code I have tried.

import PyPDF2
import re

obj = PyPDF2.PdfFileReader(r"avnet_202209 (1).pdf")

pgno = obj.getNumPages()

S = "Basis of presentation and new accounting pronouncements"

for i in range(0, pgno):
    PgOb = obj.getPage(i)
    Text = PgOb.extractText()
    if re.search(S,Text):
         print("String Found on Page: " + str(i))

The output was : String Found on Page: 7 String Found on Page: 22

Required output: String Found on Page: 8 String Found on Page: 23 ​

Anil Soren
  • 33
  • 1
  • 1
  • 5
  • 1
    Isn't the actual file pagenumber just `i + 1`..? – AKX Dec 21 '22 at 08:05
  • 1
    Does this answer your question? [Retrieve Custom page labels from document with pyPDF](https://stackoverflow.com/questions/12360999/retrieve-custom-page-labels-from-document-with-pypdf) – Ron Dec 21 '22 at 08:07
  • @AKX But how does the code identify itself that doc page number (mentioned at below) start from a specific page – Anil Soren Dec 21 '22 at 08:13
  • 1
    @Ron Not exactly. Those codes shows total number of pages in the pdf – Anil Soren Dec 21 '22 at 08:14
  • I would say that the document would always start at page `1`. Do you have any examples that show otherwise ?? – ScottC Dec 28 '22 at 16:07

1 Answers1

0

There are three things you could mean.

I assume you install pypdf. I became the maintainer of pypdf and PyPDF2. We only continue development of pypdf and pypdf already has all of the features of PyPDF2.

1. Page Index

That is what you start with. If you iterate over the pages, you just need to save the index:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for index, page in enumerate(reader.pages)
    ...

For example, when I want to print something I need to enter (index+1).

2. Total page count

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(f"The PDF document has {len(reader.pages)} pages in total")

3. Page label

For example, this file has the first few pages being labeled as i, ii, ...

Those are called "page labels".

I've recently added a PR to add support in pypdf

I will make a release latest on Sunday (01.01.2023). You can then do this:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
for index, page in enumerate(reader.pages)
    print(f"index={index}: label={reader.page_labels[index]}")
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958