27

I was wondering if anyone had any experience in working programmatically with .pdf files. I have a .pdf file and I need to crop every page down to a certain size.

After a quick Google search I found the pyPdf library for python but my experiments with it failed. When I changed the cropBox and trimBox attributes on a page object the results were not what I had expected and appeared to be quite random.

Has anyone had any experience with this? Code examples would be well appreciated, preferably in python.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
johannth
  • 291
  • 1
  • 3
  • 5

7 Answers7

44

pyPdf does what I expect in this area. Using the following script:

#!/usr/bin/python
#

from pyPdf import PdfFileWriter, PdfFileReader

with open("in.pdf", "rb") as in_f:
    input1 = PdfFileReader(in_f)
    output = PdfFileWriter()

    numPages = input1.getNumPages()
    print "document has %s pages." % numPages

    for i in range(numPages):
        page = input1.getPage(i)
        print page.mediaBox.getUpperRight_x(), page.mediaBox.getUpperRight_y()
        page.trimBox.lowerLeft = (25, 25)
        page.trimBox.upperRight = (225, 225)
        page.cropBox.lowerLeft = (50, 50)
        page.cropBox.upperRight = (200, 200)
        output.addPage(page)

    with open("out.pdf", "wb") as out_f:
        output.write(out_f)

The resulting document has a trim box that is 200x200 points and starts at 25,25 points inside the media box. The crop box is 25 points inside the trim box.

Here is how my sample document looks in acrobat professional after processing with the above code: crop pages screenshot

This document will appear blank when loaded in acrobat reader.

Daniel Griscom
  • 1,834
  • 2
  • 26
  • 50
danio
  • 8,548
  • 6
  • 47
  • 55
  • This code has the same effect as the code I was experimenting with; the pages of the resulting document were certainly cropped but all blank. Any ideas why that might be? – johannth Jan 22 '09 at 20:26
  • You've probably checked this but all I can think is that you are cropping a small area of the PDF that is blank? If you have access to acrobat pro you can use the crop pages tool to show all the page boxes. I don't know of any free tools that can do this. Maybe evince or okular for linux? – danio Jan 23 '09 at 14:21
  • 1
    I feel really stupid. I misread the api and assumed that the cropbox was upperLeft, lowerRight. So I was indeed just cropping to a blank part of the page. – johannth Jan 24 '09 at 10:56
  • Easily done if you are used to working with screen co-ordinates with the origin at top-left. Took me a while to get used to having the origin at bottom-left in PDF but now I am so used to it I find it jarring to switch back to top-left for screen layout work! – danio Jan 26 '09 at 11:23
  • 12
    Why does the original page text (outside the trimBox) 'follow' the cropped PDF ? If I do the above and try to include the crop in another PDF (via LaTex, for example), and scale the crop down, the original text is still there, selectable, albeit invisible. Modifying `page.mediaBox` doesn't seem to help. Any suggestions on how to actually cut the PDF down to trimBox size ? Thanks. – Alex Constantin Jun 03 '13 at 15:20
  • @AlexConstantin well it's taken 4.5 yrs for me to notice your comment and it's now 8 years since I have done any PDF programming but I think you want to scale the contents of your PDF? The boxes are just for choosing parts of the page you are interested in so they will not be able to help. PDF supports this with coordinate transformation matrices, but that is beyond the scope of pyPdf. It's done with the `cm` operator (see [PDF spec](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) section 8.4.4) which can be manually edited with some tool like PDF cosedit. – danio Feb 07 '18 at 13:09
  • 2
    @danio No problem. What I was after was a destructive crop of the PDF area & text. In the end I managed to do it rather easily with `ghostscript`. – Alex Constantin Feb 07 '18 at 20:44
  • 2
    Another culprit that can cause empty pages (ask me how I know...), at least if one is using the newer (API-compatible) PyPDF2, is closing the input file backing `PdfFileReader` before calling `write()` on the `PdfFileWriter`; The pages don't seem to be cached to memory, instead being read from disk as needed (which is smart), so if you close the input file before writing the output file, it can't find the contents - but instead of reporting an error it silently creates blank pages (which seems less smart). – Aleksi Torhamo May 12 '18 at 18:09
  • @AlexConstantin modifying the boxes is indeed non-destructive, and by intent so that a page can be cropped, and then recropped again to be larger if need be without losing any of the original document. It's out of scope of pyPdf to be able to do such things as removing content outside a specified area. Glad Ghostscript helped, but other people should be aware that I it may lose other PDF features on the way (e.g. transparency) AFAIK – danio May 17 '18 at 09:59
  • @AlexConstantin can you please elaborate how you got rid of the unwanted text in the page, and only cropped out the table from pdf? – Jack Daniels Jan 24 '19 at 08:19
  • 1
    @JackDaniels This was quite a while ago: First do the resize crop like explained above. Then something like `gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=destructive_crop.pdf resize_crop.pdf` – Alex Constantin Jan 26 '19 at 09:48
  • How can I change such lines to specify the desired part `page.trimBox.lowerLeft = (25, 25)`? – YasserKhalil Feb 08 '21 at 21:03
  • @AlexConstantin If I'm understanding right, I'd first set /cropbox through python and then perform the destructive crop with that command? Sorry to recall this after a few years but a dive into ghostscript cropping left me in a worse conditions. – A. Perez Cera Mar 16 '21 at 19:33
  • @A.PerezCera Correct. I may have set mediaBox as well through Python, to be sure. – Alex Constantin Mar 17 '21 at 08:00
15

Use this to get the dimension of pdf

from PyPDF2 import PdfWriter, PdfReader, PdfMerger

reader = PdfReader("/Users/user.name/Downloads/sample.pdf")
page = reader.pages[0]
print(page.cropbox.lower_left)
print(page.cropbox.lower_right)
print(page.cropbox.upper_left)
print(page.cropbox.upper_right)

After this get page reference and then apply crop command

page.mediabox.lower_right = (lower_right_new_x_coordinate, lower_right_new_y_coordinate)
page.mediabox.lower_left = (lower_left_new_x_coordinate, lower_left_new_y_coordinate)
page.mediabox.upper_right = (upper_right_new_x_coordinate, upper_right_new_y_coordinate)
page.mediabox.upper_left = (upper_left_new_x_coordinate, upper_left_new_y_coordinate)

#f or example :- my custom coordinates 
# page.mediabox.lower_right = (611, 500)
# page.mediabox.lower_left = (0, 500)
# page.mediabox.upper_right = (611, 700)
# page.mediabox.upper_left = (0, 700)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
7

How do I know the coordinates to crop?

Thanks for all answers above.

Step 1. Run the following code to get (x1, y1).

from PyPDF2 import PdfWriter, PdfReader

reader = PdfReader("test.pdf")
page = reader.pages[0]
print(page.cropbox.upper_right)

Step 2. View the pdf file in full screen mode.

Step 3. Capture the screen as an image file screen.jpg.

Step 4. Open screen.jpg by MS paint or GIMP. These applications show the coordinate of the cursor.

Step 5. Remember the following coordinates, (x2, y2), (x3, y3), (x4, y4) and (x5, y5), where (x4, y4) and (x5, y5) determine the rectangle you want to crop.

enter image description here

Step 6. Get page.cropbox.upper_left and page.cropbox.lower_right by the following formulas. Here is a tool for calculating.

page.cropbox.upper_left = (x1*(x4-x2)/(x3-x2),(1-y4/y3)*y1)
page.cropbox.lower_right = (x1*(x5-x2)/(x3-x2),(1-y5/y3)*y1)

Step 7. Run the following code to crop the pdf file.

from PyPDF2 import PdfWriter, PdfReader

reader = PdfReader('test.pdf') 
writer = PdfWriter()

for page in reader.pages:
  page.cropbox.upper_left = (100,200)
  page.cropbox.lower_right = (300,400)
  writer.add_page(page) 
  
with open('result.pdf','wb') as fp:
    writer.write(fp) 
bfhaha
  • 401
  • 4
  • 6
1

You are probably looking for a free solution, but if you have money to spend, PDFlib is a fabulous library. It has never disappointed me.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
0

You can convert the PDF to Postscript (pstopdf or ps2pdf) and than use text processing on the Postscript file. After that you can convert the output back to PDF.

This works nicely if the PDFs you want to process are all generated by the same application and are somewhat similar. If they come from different sources it is usually to hard to process the Postscript files - the structure is varying to much. But even than you migt be able to fix page sizes and the like with a few regular expressions.

max
  • 29,122
  • 12
  • 52
  • 79
0

Acrobat Javascript API has a setPageBoxes method, but Adobe doesn't provide any Python code samples. Only C++, C# and VB.

-1

Cropping pages of a .pdf file

from PIL import Image
def ImageCrop():
    img = Image.open("page_1.jpg")
    left = 90
    top = 580
    right = 1600
    bottom = 2000
    img_res = img.crop((left, top, right, bottom))
    with open(outfile4, 'w') as f:
        img_res.save(outfile4,'JPEG')
ImageCrop()
dataninsight
  • 1,069
  • 6
  • 13