2

My objective is actually cropping a PDF file with PdfClown. There are a lot of tools/library that allow cropping PDF, changing the PDF cropBox. This permits hiding contents outside a rectangular area, but content is still there, it might be accessed through a PDF parser and PDF size does not change.

On the contrary what I need is creating a new page containing only the contents inside the rectangular area.

So far I've tried scanning contents and selectively cloning them. But I didn't succeed yet. Any suggestions on using PdfClown for that?

I've seen someone is trying something similar with PdfBox Cropping a region from a PDF page with PDFBox not succeeding yet.

Community
  • 1
  • 1
Lorenza
  • 37
  • 3
  • 1
    I edited my question to be more clear. It seemed to me I've asked more or less the same thing, but maybe I'm wrong. At a certain point the author says _Yes I think I want the second one. I want a new PDF that will contain ONLY the drawing, text and image instructions that are within the bounding box_ – Lorenza Jun 06 '16 at 10:03
  • 1
    Oh, I see. I've deleted my comment. You are indeed asking the almost impossible thing. – Tilman Hausherr Jun 06 '16 at 10:08
  • 1
    @Lorenza What you want is an actual redaction feature which is decidedly non-trivial to implement. You might want to look at the iText 5 `PdfCleanUp` functionality in the itext-xtra package for inspiration. This implementation already works well for quite a number of PDFs but there still is some way to go for general usability. (It has been removed from the open functionality in iText 7 and now development is continued as closed source add-on.) – mkl Jun 06 '16 at 11:05
  • Could you please give some motivation when you vote down? I'm a newbie with stackoverflow but I would like to learn something even from vote down. Thanks – Lorenza Jun 07 '16 at 13:58
  • I tried the iText5 PdfCleanUp and it actually removes the text content that is inside a specific rectangular area. Not sure it makes the same with other contents especially because the pdf size doesn't change so much. I will have a look deep inside. – Lorenza Jun 08 '16 at 11:35
  • @Lorenza you can try to look at the content stream with itext RUPS to see what changed. – Tilman Hausherr Jun 08 '16 at 14:06
  • @Lorenza *Not sure it makes the same with other contents especially because the pdf size doesn't change so much* - Redaction / Cleanup is not really a feature for making PDFs smaller, merely to prevent certain contents to be extractable. The necessary changes in the content stream sometimes can make the file even grow. The focus for such tools is security, not file size. For file size optimization one would use different strategies. – mkl Jun 10 '16 at 16:49

1 Answers1

1

A bit late, but maybe it helps someone; I am sucessfully doing what you are asking for - but with other libraries. Required libraries : iText 4 or 5 and Ghostscript

Step 1 with pseudo code

Using iText, Create a PDFWRITER instance with a blank Doc. Open a PDFREADER object to the original file you want to crop. Import the Page, get a PDFTemplate Object from the source, set its .boundingBox property to the desired cropbox, wrap the template into an iText Image object and paste it onto the new page at an absolute position.

Dim reader As New PdfReader(sourcefile)
Dim doc As New Document()
Dim writer As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(outputfilename, System.IO.FileMode.Create))

//get the source page as an Imported Page
Dim page As PdfImportedPage = writer.GetImportedPage(reader, indexOfPageToGet) page

//create PDFTemplate Object at original size from source - see iText in Action book Page 91 for full details
Dim pdftemp As PdfTemplate = page.CreateTemplate(page.Width, page.Height) 
//paste the original page onto the template object, see iText documentation what those parameters do (scaling, mirroring)
pdftemp.AddTemplate(page, 1, 0, 0, 1, 0, 0)
//now the critical part - set .boundingBox property on the template. This makes all objects outside the rectangle invisible
pdftemp.boundingBox = {iText Rectangle Structure with new Cropbox}
//template not needed anymore
writer.ReleaseTemplate(pdftemp) 
//create an iText IMAGE object as wrapper to the template - with this img object absolute positionion on the final page is much easier
dim img as iTextSharp.Text.Image = Image.GetInstance(pdftemp)
// set img position
img.SetAbsolutePosition(x, y)
//set optional Rotation if needed
img.RotationDegrees = 0
//finally, this adds the actual content to the new document
doc.Add(img) 
//cleanup
doc.Close()
reader.Close()
writer.Close()

The output file will visually look cropped. But the objects are still present in the PDF Stream. Filesize will probably remain very little changed yet.

Step 2:

Using Ghostscript and output device pdfwrite, combined with the correct command line parameters you can re-process the PDF from Step 1. This will give you a much smaller PDF. See Ghostscript documentation for the arguments https://www.ghostscript.com/doc/9.52/Use.htm This steps actually gets rid of objects that are outside the bounding box - the requirement you asked for in your OP, at least for files that I deal with.

Optional Step 3: Using MUTOOL with the -g option you can clean up unused XREF objects. Your original PDF probably had a lot of Xrefs, which increase filesize. After cropping some of them may not be needed anymore. https://mupdf.com/docs/manual-mutool-clean.html

PDF Format is a tricky thing, normally I would agree with @Tilman Hausherr, my suggestion may not work for all files and covers the 'almost impossible' scenario, but it works for all cases that I deal with.

Hakan Usakli
  • 492
  • 5
  • 11