6

I'm using Python 3.8.5. I'm trying to write a short script that concatenates PDF files and learning from this Stack Overflow question, I'm trying to use PyPDF2. Unfortunately, I can't seem to even create a PyPDF2.PdfFileReader instance without crashing.

My code looks like this:

import pathlib
import PyPDF2

pdf_path = pathlib.Path('1.pdf')
with pdf_path.open('rb') as pdf_file:
    reader = PyPDF2.PdfFileReader(pdf_file, strict=False)

When I try to run it, I get the following traceback:

Traceback (most recent call last):
  File "C:\...\pdf\open_pdf.py", line 6, in <module>
    reader = PyPDF2.PdfFileReader(pdf_file, strict=False)
  File "C:\...\.virtualenvs\pdf-j0HnXL2B\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
    self.read(stream)
  File "C:\...\.virtualenvs\pdf-j0HnXL2B\lib\site-packages\PyPDF2\pdf.py", line 1883, in read
    stream.seek(-11, 1)
OSError: [Errno 22] Invalid argument

To help reproduce the problem, I created this GitHub repo with the above code and a sample PDF file.

What am I doing wrong?

Amir Rachum
  • 76,817
  • 74
  • 166
  • 248
  • It could be that your PDF document uses a more recent version than the one PyPDF2 supports. I tried your code with a PDF 1.3 document and it works fine. Your PDF document is a 1.7 version. – Mario Camilleri Sep 26 '20 at 14:29

3 Answers3

2

It seems like your 1.pdf file fails validation, checked here: https://www.pdf-online.com/osa/validate.aspx

I tried with another pdf file of version 1.7 and it worked, so it's not about pdf version, you just have a bad 1.pdf file

GProst
  • 9,229
  • 3
  • 25
  • 47
  • Thanks for the pointer. However, this PDF was created by scanning with my scanner (using the standard Windows 10 Scan application). I can open it in any other program (Chrome, Foxit, etc.) What can I do? The whole point of the script is to automate scanning operations. – Amir Rachum Sep 26 '20 at 14:40
  • Perhaps this will help: https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file FYI, I repaired your file in online tool [here](https://www.ilovepdf.com/repair-pdf) and your script worked without an error, so it should be possible to repair your files programmatically. – GProst Sep 26 '20 at 14:44
2

You can accomplish this with PyMuPDF (install - at least on Windows - with pip install pymupdf). The basic pattern for concatenating files is:

import fitz

doc1 = fitz.Document('filename1.pdf')
doc2 = fitz.Document('filename2.pdf')

combined = fitz.Document()  # empty document
combined.insertPDF(doc1)
combined.insertPDF(doc2)
combined.save('combinedfile.pdf')

I tested with your file, and it does issue a warning about the invalid cross reference structure in the PDF, but will work. (The file it creates is valid PDF-1.4)

tegancp
  • 1,204
  • 6
  • 13
-1

the code is good but you need to reduce the pdf size its too large to handle one dummy way to do it is to open the pdf file and press print and in the printers selection use Microsoft print pdf and use this file it should not affect the quality of the file

noob
  • 672
  • 10
  • 28