Can't merge PDFs with PyPDF2 - ValueError

Question

I try to merge PDFs I have downloaded from Google Drive and I get this error:

ValueError: invalid literal for int() with base 10: b'F-1.4'

This does not happen when I merge PDFs that I generated with Keynote.

The full error reads like this:

Traceback (most recent call last):
  File "weekly_meeting.py", line 36, in <module>
    file_path = sort_pdf(path)
  File "weekly_meeting.py", line 15, in sort_pdf
    pdf_merger.append(file)
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/merger.py", line 203, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/merger.py", line 151, in merge
    outline = pdfr.getOutlines()
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1346, in getOutlines
    lines = catalog["/Outlines"]
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/generic.py", line 178, in getObject
    return self.pdf.getObject(self).getObject()
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1599, in getObject
    idnum, generation = self.readObjectHeader(self.stream)
  File "/usr/local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

I tried

opening the PDF Files - they are normal working PDF
exporting them with Preview, again as PDF - they still produce the error
other PDFs - they seem to work fine

This is my code, the problems seems to be the pdf_merger.append(file):

def sort_pdf(path):
    pdf_merger = PdfFileMerger()
    if (os.path.isdir(path)):
        head, file_name = os.path.split(path)
        os.chdir(path)
        chronology = ["OVERVIEW", "CUSTOMER", "PROJECT", "PERSONAL"]
        for prefix in chronology:
            for file in glob.glob(prefix + "*.pdf"):
                pdf_merger.append(file)
        file_path = path + "/" + file_name + ".pdf"
        with open(file_path, 'wb') as result:
            pdf_merger.write(result)
        return file_path

I expected the output to be a sorted and combined PDF, which I already have achieved with other documents.

Look like your input PDF is broken. This `b'F-1.4'` should read `b'%PDF-1.4'` — stovfl, Jan 13 '19 at 19:43
I guess that is something I could solve programmatically, right? Check the header and repair it before I try to sort the PDF? Any idea how I could change the file header? — zagatta-sonah, Jan 14 '19 at 08:23
*"could solve programmatically, right? "*: **No**, verify if you can open the PDF with a PDF-Reader. Open with a editor, e.g. leafpad, an verify if the first chars are equal to `'%PDF-1.4'`. — stovfl, Jan 14 '19 at 10:43
Relevant: [PyPDF2/issues/183](https://github.com/mstamy2/PyPDF2/issues/183) — stovfl, Jan 14 '19 at 11:38
I solved it by just writing the header: pdf_reader._header = b_("%PDF-1.4") — zagatta-sonah, Jan 14 '19 at 22:23

score 1 · Answer 1 · answered Jan 14 '19 at 22:27

Look like your input PDF is broken. This b'F-1.4' should read b'%PDF-1.4' – stovfl

Using the PdfFileReader and PdfFileWriter instead of the PdfFilerMerge with the following code solved the problem for me:

for file in glob.glob(prefix + "*.pdf"):
                pdf_reader = PdfFileReader(file)
                pdf_reader._header = b_("%PDF-1.4")
                for page in range(pdf_reader.getNumPages()):
                    pdf_writer.addPage(pdf_reader.getPage(page))

Just overwriting the header beasically.

You make `PdfFileReader` happy, but the `PDF` is still broken. — stovfl, Jan 15 '19 at 07:10

score 1 · Answer 2 · answered Apr 08 '20 at 08:53

This worked for me. It is based on this, I just completed the code with import statement and fixed indentation issues.

import PyPDF2

pdfs = ['1.pdf', '2.pdf', '3.pdf']

pdfWriter = PyPDF2.PdfFileWriter()

# loop through all PDFs
for filename in pdfs:
    # rb for read binary
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # Opening each page of the PDF
    for pageNum in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)

# save PDF to file, wb for write binary
pdfOutput = open('output.pdf', 'wb')
# Outputting the PDF
pdfWriter.write(pdfOutput)
# Closing the PDF writer
pdfOutput.close()

Can't merge PDFs with PyPDF2 - ValueError

2 Answers2