-1

I am using the following code to resize pages in a PDF:

from pypdf import PdfReader, PdfWriter, Transformation, PageObject, PaperSize
from pypdf.generic import RectangleObject

reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
  

  A4_w = PaperSize.A4.width
  A4_h = PaperSize.A4.height

  # resize page to fit *inside* A4
  h = float(page.mediabox.height)
  w = float(page.mediabox.width)
  scale_factor = min(A4_h/h, A4_w/w)

  transform = Transformation().scale(scale_factor,scale_factor).translate(0, A4_h/2 - h*scale_factor/2)
  page.add_transformation(transform)

  page.cropbox = RectangleObject((0, 0, A4_w, A4_h))

  # merge the pages to fit inside A4

  # prepare A4 blank page
  page_A4 = PageObject.create_blank_page(width = A4_w, height = A4_h)
  page.mediabox = page_A4.mediabox
  page_A4.merge_page(page)

  writer.add_page(page_A4)
writer.write('output.pdf')

Source: https://stackoverflow.com/a/75274841/11501160

While this code works fine for the resizing part, I have found that most input files work fine but some input files do not work fine.

I am providing download links to input.pdf and output.pdf files for testing and review. The output file is completely different from the input file. The images are missing, the background colour is different, even the pure text on first page has only the first line visible.

What is interesting is that these difference are only seen when I open the output pdf in Adobe Acrobat, or look at the physically printed pages. The PDF looks perfect when i open in Preview (on MacOS) or open the PDF in my Chrome Browser.

input file

and

output file

The origin of the input pdf is that I created it in Preview (on MacOS) by mixing pages from different PDFs and dragging image files into the thumbnails as per these instructions: https://support.apple.com/en-ca/HT202945 I've never had a problem before while making PDFs like this and even Adobe Acrobat reads the input pdf properly. Only the output pdf is problematic in Acrobat and in printers.

Is this a bug with pypdf or am I doing something wrong ? How can i get the output PDF to be proper in Adobe Acrobat and printers etc ?

Zain Khaishagi
  • 135
  • 1
  • 9
  • If you are asking for help, you should mention all circumstances of your problem. This helps other people avoiding waste of their time. In this case you are omitting that you need to run inside Google Colab. – Jorj McKie Feb 05 '23 at 08:45

2 Answers2

0

This is a valid bug with pypdf and the fix is due to be released in the next version. Refer: https://github.com/py-pdf/pypdf/issues/1607

Zain Khaishagi
  • 135
  • 1
  • 9
-1

The following is what PyMuPDF has to offer here. The output displays correctly in all PDF readers:

import fitz  # import PyMuPDF

src = fitz.open("input.pdf")
doc = fitz.open()
for i in range(len(src)):
    page = doc.new_page()  # this is A4 portrait by default
    page.show_pdf_page(page.rect, src, i)  # scaling will happen automatically
doc.save("fitz-output.pdf",garbage=3,deflate=True)

The above method show_pdf_page() supports many more options, like selecting sub-rectangles form the source page, rotating it by arbitrary angles, and of course freely select the target page's sub-rectangle to receive the content.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
  • I'm running it in Google Colab and doesn't seem to work for me. I get RuntimeError: Directory 'static/' does not exist – Zain Khaishagi Feb 02 '23 at 03:52
  • Well, PyMuPDF must of course be orderly installed, `python -m pip install pymupdf`. In a cloud environment, the specific installation procedures of the cloud service must be followed. Because PyMuPDF is **not pure Python**, there usually exist special ways to install packages that contain binaries. PyMuPDF does run on Jupyter notebooks, upon which Google colab is also based. – Jorj McKie Feb 02 '23 at 10:27