1

Lets say you have a pdf page with various complex elements inside. The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.

example

Here is a simplified version of my code:

import PyPDF2
import PyPdf

def extract_tree(in_file, out_file):
    with open(in_file, 'rb') as infp:
        # Read the document that contains the tree (in its first page)
        reader = pyPdf.PdfFileReader(infp)
        page = reader.getPage(0)

        # Crop the tree. Coordinates below are only referential
        page.cropBox.lowerLeft = [100,200]
        page.cropBox.upperRight = [250,300]

        # Create an empty document and add a single page containing only the cropped page
        writer = pyPdf.PdfFileWriter()
        writer.addPage(page)
        with open(out_file, 'wb') as outfp:
            writer.write(outfp)

def insert_tree_into_page(tree_document, text_document):
    # Load the first page of the document containing 'text text text text...'
    text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)

    # Load the previously cropped tree (cropped using 'extract_tree')
    tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)

    # Overlay the text-page and the tree-crop   
    text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')

    # Save the result into a new empty document
    output = PyPDF2.PdfFileWriter()
    output.addPage(text_page)
    outputStream = file('merged_document.pdf','wb')
    output.write(outputStream)



# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')

# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')

The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree). The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway

caspillaga
  • 573
  • 4
  • 16
  • How did you determine the bounds for the Tree Cropping? Any example pdfs you can point us towards? My guess is the cropping is still grabbing those images. – Edeki Okoh Feb 26 '19 at 18:06
  • 1
    Hi Edeki... The code I posted is a simplified version of my code. I implemented an interface to define the bounds of the crop, but In the above code I hard-coded some random values as reference. I'm pretty sure that the bounds are not the problem, because "cropped_document.pdf" seems perfectly cropped, but when I try to merge it with the target page, the crop is ignored and the entire page is pasted (instead of just the crop). My guess is thay maybe I'm misunderstanding the purpose of cropBox – caspillaga Feb 26 '19 at 18:14
  • Try saving the cropped image itself and not using [writer.addPage(page)](https://pythonhosted.org/PyPDF2/PdfFileWriter.html). It looks like that method adds a page in the existing pdf, but you are still calling page 1 of the pdf in the insert_tree function. But the cropped image is on the second page because of that method so it will merge the house and star also. – Edeki Okoh Feb 26 '19 at 18:22
  • As an example, I used a research paper that contains a graph (example: https://arxiv.org/pdf/1807.03819.pdf). I cropped the graph and then tried to overlay it in a random location of the first page. – caspillaga Feb 26 '19 at 18:26
  • Made an edit to my previous comment. Is the cropped image on page 1 or page 2 of the cropped_document.pdf? I think the issue might be the use of the writer.addPage(page) and you doing the merge on the first page of the document using getPage(0). – Edeki Okoh Feb 26 '19 at 18:28
  • In my example code, I'm assuming all PDF files have only one page. writer.addPage(page) adds a page to an empty new document, so it will have only one page. How can you save the cropped image itself without writer.addPage(page)? – caspillaga Feb 26 '19 at 18:30
  • Let me try and recreate the issue first using some test docs. I think i see the issue. – Edeki Okoh Feb 26 '19 at 18:34
  • Thanks Edeki. I added extra comments to the code, to make it more understandable – caspillaga Feb 26 '19 at 18:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189087/discussion-between-edeki-okoh-and-caspillaga). – Edeki Okoh Feb 26 '19 at 18:44
  • Justo in case it helps someone: what I ended up doing was to crop the PDF regions, then convert into SVG format and then back to PDF, and finally merge. This solved my problem. My guess is that in this case pypdf2 only edits the metadata of the page and not the actual content, thus the unwanted regions survive and reappear un the final merge. SVG, instead, only saves the desired region and discards the rest. The only inconvenint is that text si no longer editable as it will be converted into vector drawings, but in my case this was not a problem – caspillaga Feb 23 '21 at 13:47

2 Answers2

0

I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).

Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.

Anand
  • 37
  • 4
0

I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:

page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])

(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.

from PyPDF2 import PdfFileReader, PdfFileWriter

tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)

tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]

text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))

Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.