27

I am trying to open a pdf to get the number of pages. I am using PyPDF2.

Here is my code:

def pdfPageReader(file_name):
    try:
        reader = PyPDF2.PdfReader(file_name, strict=True)
        number_of_pages = len(reader.pages)
        print(f"{file_name} = {number_of_pages}")
        return number_of_pages
    except:
        return "1"

But then i run into this error:

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

I tried to use strict=True and strict=False, When it is True, it displays this message, and nothing, I waited for 30minutes, but nothing happened. When it is False, it just display nothing, and that's it, just do nothing, if I press ctrl+c on the terminal (cmd, windows 10) then it cancel that open and continues (I run this in a batch of pdf files). Only 1 in the batch got this problem.

My questions are, how do I fix this, or how do I skip this, or how can I cancel this and move on with the other pdf files?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
JBin
  • 471
  • 1
  • 6
  • 18

5 Answers5

30

If somebody had a similar problem and it even crashed the program with this error message

File "C:\Programy\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1604, in getObject % (indirectReference.idnum, indirectReference.generation, idnum, generation)) PyPDF2.utils.PdfReadError: Expected object ID (14 0) does not match actual (13 0); xref table not zero-indexed.

It helped me to add the strict argument equal to False for my pdf reader

pdf_reader = PdfReader(input_file, strict=False)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
DovaX
  • 958
  • 11
  • 16
9

For anybody else who may be running into this problem, and found that strict=False didn't help, I was able to solve the problem by just re-saving a new copy of the file in Adobe Acrobat Reader. I just opened the PDF file inside an actual copy of Adobe Acrobat Reader (the plain ol' free version on Windows), did a "Save as...", and gave the file a new name. Then I ran my script again using the newly saved copy of my PDF file.

Apparently, the PDF file I was using, which was generated directly from my scanner, was somehow corrupt, even though I could open and view it just fine in Reader. Making a duplicate copy of the file via re-saving in Acrobat Reader somehow seemed to correct whatever was missing.

Bill M.
  • 1,388
  • 1
  • 8
  • 16
  • 1
    Good! It worked for me opening the PDF in Adobe Acrobat, then Save as.... It passed from 900kb to 500kb. Now it works. – Camilo Apr 20 '23 at 19:53
5

I had the same problem and looked for a way to skip it. I am not a programmer but looking at the documentation about warnings there is a piece of code that helps you avoid such hindrance.

Although I wouldn't recomend this as a solution, the piece of code that I used for my purpose is (just copied and pasted it from doc on link)

import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
cektek1
  • 61
  • 1
  • 7
3

This happens to me when the file was created in a printer / scanner combo that generates PDFs. I could read in the PDF with only a warning though so I read it in, and then rewrote it as a new file. I could append that new one.

from PyPDF2 import PdfMerger, PdfReader, PdfWriter

reader = PdfReader("scanner_generated.pdf", strict=False)
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

with open("fixedPDF.pdf", "wb") as fp:
    writer.write(fp)

merger = PdfMerger()
merger.append("fixedPDF.pdf")
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
user3660637
  • 624
  • 6
  • 16
0

I had the exact same problem, and the solutions did help but didn't solve the problem completely, at least the one setting strict=False & resaving the document using Acrobat reader. Anyway, I still got a stream error, but I was able to fix it after using an PDF online repair. I used sejda.com but please be aware that you are uploading your PDF on some website, so make sure there is nothing sensible in there.

Erijl
  • 11
  • 1
  • 5