7

I am trying to crop a region out of a PDF page programmatically. Specifically, my input is going to be a single page PDF and a bounding box on the page. Output is going to be a PDF that contains the characters, graphics paths and images from the original PDF, and it should look like the original PDF. In other words, I want a function that is similar to cropping a region out of an image, but with PDFs.

Three questions:

  1. Is it at all possible to do? From my knowledge of PDFs, it seems possible. But I'm no expert, so I would like to know first if there are some things I'm missing here.

  2. Is there any open source software for this?

  3. Can PDFBox do this currently? I couldn't find such a functionality but I might have missed it. Does anybody know of any attempt of doing this?

rivu
  • 2,004
  • 2
  • 29
  • 45
  • 1
    Do you merely mean that after cropping a PDF viewer will only show the region within that box? Or do you actually mean that even the drawing instructions for content outside that box should be removed? The former is what @Tilman describes in his answer, the latter is much more complicated. – mkl Mar 21 '16 at 10:49
  • 1
    Yes I think I want the second one. I want a new PDF that will contain ONLY the drawing, text and image instructions that are within the bounding box. – rivu Mar 21 '16 at 16:30
  • For the second one would be interesting: every operation must be checked and adjusted depending whether it is inside or outside of your box. And have some strategy for things that are partly in the box. (Yeah, "this will never happen", LOL) – Tilman Hausherr Mar 21 '16 at 17:57
  • Let's say I can disregard commands partly outside the box. I think the first step would be to calculate the bounding box for each operation. Does PDFBox do this for paths? – rivu Mar 21 '16 at 18:01
  • No. The complete path is constructed. awt does the clipping. – Tilman Hausherr Mar 21 '16 at 19:11
  • @rivu essentially you are asking for something akin to redaction, and redaction is not trivial. Basically you have to parse the content stream of the page in question, keeping track of the graphics state (to know current transformation matrices and clipping paths), and analyze instruction by instruction. If the instruction intersects the area you want to remain, keep it; otherwise drop it or replace it (if it has a side effect, e.g. in case of text drawing instructions, there often is a side effect touching the text matrix and probably even the text line matrix). – mkl Mar 23 '16 at 08:40
  • PDFBox 2.0.0 offers you a stream parsing classes `PDFStreamEngine`, `PDFGraphicsStreamEngine`, etc. as a framework for such a task. – mkl Mar 23 '16 at 08:46
  • Thanks @mkl. I understood mostly what you are saying, except the `side effect` part. It does seem surprising though that nobody has ever tried this. That's the reason I'm getting confused. – rivu Mar 23 '16 at 21:44
  • And thanks for pointing out the resources. – rivu Mar 23 '16 at 21:45
  • @rivu *except the side effect part* - text drawing operations in pdfs can be used like "go position - draw a string thereafter - draw another string thereafter - draw yet another string thereafter ..." If you determine for such a situation that the first string is outside your area but the next two are inside, you cannot simply drop the instruction for drawing the first string as that would make the following ones move left. ... – mkl Mar 24 '16 at 05:16
  • ... Instead, you either have to change the initial "go position" or replace the first string drawing instruction by "move right" for a distance as long as the first string was. – mkl Mar 24 '16 at 05:16
  • *It does seem surprising though that nobody has ever tried this.* - I do think people have tried but determined things are too complicated. E.g. before 2.0.0 there was no generic parsing framework but only a specialized one (for text extraction), and tweaking it for other tasks was quite a pain. Maybe some have tried successfully but keep the source for themselves. – mkl Mar 24 '16 at 05:20
  • Thanks so much mkl. Just a quick question. For text, is there any code that gives the bounding box, font and other style information for each character? – rivu Mar 24 '16 at 15:45
  • Yes, look at the textinfo object attributes. – mkl Mar 29 '16 at 05:45
  • Hi @rivu, I'm looking for the same thing. Did you end up doing it? can you share some code? – Yekhezkel Yovel Apr 07 '16 at 13:03
  • No, not yet. But if I do, the code ll be surely public. I'll share the link. – rivu Apr 07 '16 at 13:29

1 Answers1

3

1- Yes, this is called the crop box.

2- Yes, e.g. PDFBox.

3- Yes, just open a PDF, set a crop box, and save it:

PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0);
page.setCropBox(new PDRectangle(20, 20, 200, 400));
doc.save(...);
doc.close();

The numbers in PDRectangle are user space units. 1 unit = 1/72 inches.

Note that the contents outside the cropbox are not gone, they are just hidden.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • From Adobe specs, `The crop box defines the region to which the contents of the page are to be clipped (cropped) when displayed or printed.` I don't think this is what I want. I want to create a new PDF that ONLY contains the graphics paths, characters and images within the bounding box in the original PDF. – rivu Mar 21 '16 at 17:49
  • 1
    The resulting PDF seems correct, but with further analysis you can see that the text outside the cropbox is not gone, instead it's just hidden. – Fabio Oct 31 '18 at 09:53
  • 1
    Yes that's the point of a cropbox. For more scary insights, read the comments below the question. – Tilman Hausherr Oct 31 '18 at 09:57
  • 1
    @TilmanHausherr I read them, I was just pointing out, for the casual user that tries the code directly without reading all the comments above (as myself 1 hour ago), that this could be a good enough solution for many, but has its drawbacks. – Fabio Oct 31 '18 at 10:00
  • 1
    @Fabio got it. I've included your comment in my answer. Thank you :-) – Tilman Hausherr Oct 31 '18 at 10:03