14

I would like to know a way to remove white margins from a PDF file. Just like Adobe Acrobat X Pro does. I understand it will not work with every PDF file.

I would guess that the way to do it, is by getting the text margins, then cropping out of that margins.

PyPdf is preferred.

iText finds text margins based on this code:

public void addMarginRectangle(String src, String dest)
    throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT));
    TextMarginFinder finder;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        finder = parser.processContent(i, new TextMarginFinder());
        PdfContentByte cb = stamper.getOverContent(i);
        cb.rectangle(finder.getLlx(), finder.getLly(),
            finder.getWidth(), finder.getHeight());
        cb.stroke();
    }
    stamper.close();
}
jacktrades
  • 7,224
  • 13
  • 56
  • 83

2 Answers2

21

I'm not too familiar with PyPDF, but I know Ghostscript will be able to do this for you. Here are links to some other answers on similar questions:

  1. Convert PDF 2 sides per page to 1 side per page (SuperUser.com)
  2. Freeware to split a pdf's pages down the middle? (SuperUser.com)
  3. Cropping a PDF using Ghostscript 9.01 (StackOverflow.com)

The third answer is probably what made you say 'I understand it will not work with every PDF file'. It uses the pdfmark command to try and set the /CropBox into the PDF page objects.

The method of the first two answers will most likely succeed where the third one fails. This method uses a PostScript command snippet of <</PageOffset [NNN MMM]>> setpagedevice to shift and place the PDF pages on a (smaller) media size defined by the -gNNNNxMMMM parameter (which defines device width and height in pixels).

If you understand the concept behind the first two answers, you'll easily be able to adapt the method used there to crop margins on all 4 edges of a PDF page:

An example command to crop a letter sized PDF (8.5x11in == 612x792pt) by half an inch (==36pt) on each of the 4 edges (command is for Windows):

gswin32c.exe ^
    -o cropped.pdf ^
    -sDEVICE=pdfwrite ^
    -g5400x7200 ^
    -c "<</PageOffset [-36 -36]>> setpagedevice" ^
    -f input.pdf

The resulting page size will be 7.5x10in (== 540x720pt). To do the same on Linux or Mac, use:

gs \
    -o cropped.pdf \
    -sDEVICE=pdfwrite \
    -g5400x7200 \
    -c "<</PageOffset [-36 -36]>> setpagedevice" \
    -f input.pdf

Update: How to determine 'margins' with Ghostscript

A comment asked for 'automatic' determination of the white margins. You can use Ghostscript's too for this. Its bbox device can determine the area covered by the (virtual) ink on each page (and hence, indirectly the whitespace for each edge of the canvas).

Here is the command:

gs \
  -q -dBATCH -dNOPAUSE \
  -sDEVICE=bbox \
   input.pdf 

Output (example):

 %%BoundingBox: 57 29 562 764
 %%HiResBoundingBox: 57.265030 29.347046 560.245045 763.649977
 %%BoundingBox: 57 28 562 667
 %%HiResBoundingBox: 57.265030 28.347046 560.245045 666.295011

The bbox device renders each PDF page in memory (without writing any output to disk) and then prints the BoundingBox and HiResBoundingBox info to stderr. You may modify this command like that to make the results more easy to parse:

gs \
    -q -dBATCH -dNOPAUSE \
    -sDEVICE=bbox \
     input.pdf \
     2>&1 \  
  | grep -v HiResBoundingBox

Output (example):

 %%BoundingBox: 57 29 562 764
 %%BoundingBox: 57 28 561 667

This would tell you...

  • ...that the lower left corner of the content rectangle of Page 1 is at coordinates [57 29] with the upper right corner is at [562 741]
  • ...that the lower left corner of the content rectangle of Page 2 is at coordinates [57 28] with the upper right corner is at [561 667]

This means:

  • Page 1 uses a whitespace of 57pt on the left edge (72pt == 1in == 25,4mm).
  • Page 1 uses a whitespace of 29pt on the bottom edge.
  • Page 2 uses a whitespace of 57pt on the left edge.
  • Page 2 uses a whitespace of 28pt on the bottom edge.

As you can see from this simple example already, the whitespace is not exactly the same for each page. Depending on your needs (you likely want the same size for each page of a multi-page PDF, no?), you have to work out what are the minimum margins for each edge across all pages of the document.

Now what about the right and top edge whitespace? To calculate that, you need to know the original page size for each page. The most simple way to determine this: the pdfinfo utility. Example command for a 5 page PDF:

pdfinfo \
  -f 1 \
  -l 5 \
   input.pdf \
| grep "Page "

Output (example):

Page    1 size: 612 x 792 pts (letter)
Page    2 size: 612 x 792 pts (letter)
Page    3 size: 595 x 842 pts (A4)
Page    4 size: 842 x 1191 pts (A3)
Page    5 size: 612 x 792 pts (letter)

This will help you determine the required canvas size and the required (maximum) white margins of the top and right edges of each of your new PDF pages.

These calculations can all be scripted too, of course.

But if your PDFs are all of a uniq page size, or if they are 1-page documents, it all is much easier to get done...

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • How can you know automatically where the white margins are? – jacktrades May 02 '12 at 18:14
  • 2
    @jacktrades: Of course you can use iText, if you like. Feel free. However, for iText you need to write a Java program using the iText API to do it. With Ghostscript you can remain in the sphere of script programming, which I prefer in cases like this... – Kurt Pfeifle May 02 '12 at 18:30
  • Still can't understand how to find the pdf margins. iText does a similar thing like posted above. – jacktrades May 02 '12 at 18:41
  • Wow... Let me implement this, then I'd get back to you. – jacktrades May 02 '12 at 19:17
  • Great answer, well done! Hopefully the OP will express some gratitude `:)`. – halfer May 02 '12 at 20:55
  • Yes, I expressed gratitude! I'm impressed by pipitas! – jacktrades May 02 '12 at 21:14
  • Hi Pipitas, Please take a look at this: http://alexsleat.co.uk/2011/01/25/using-pdfcrop-to-remove-white-margins-ubuntu/ – jacktrades May 03 '12 at 13:25
  • Pipitas, I've tried your method. It's very slow though, is there a way to make this faster? Adobe Acrobat makes it lot faster. – jacktrades May 03 '12 at 14:21
  • @jacktrades: It depends on *what* makes your Ghostscript run slow. Remember, you're calling Ghostscript twice (once to determine the BoundingBox dimensions, once more to do the margin cropping) and you're probably also calling `pdfinfo` to determine the original page sizes in your script. If you don't use the `-q` switch, it may tell you that it is trying to find fonts (which may not be embedded in PDF) on the local system, and stuff like that. You linked to `pdfcrop` in your other comment -- how fast is that? – Kurt Pfeifle May 03 '12 at 14:59
  • It takes times in both BoundingBox dimensions and margin cropping. I'm using only those. It takes approx same time for each. `pdfcrop` seems much slower. Take less than 2 minutes for a 600 page doc. – jacktrades May 03 '12 at 15:13
  • @jacktrades: Ghostscript's `bbox` device has to completely interpret and 'render' the PDF, each page, in order to discover the real bounding box. 2 min for 600 pages gives you about 5 pages per second, which I find reasonable. (I also know of pay-ware, commercial commandline tools for Linux/Mac/Windows which will be able to crop pages probably as fast as Acrobat does... Interested?) – Kurt Pfeifle May 03 '12 at 16:51
  • @jacktrades: I've found that supplying the `--resolution 72` option to `pdfcrop` makes it much, much faster (in one of my files it was 13+ minutes vs 18 seconds). This translates to `-r72` option being passed to `ghostscript`. Also, using the `--xetex` option generates much smaller (cropped) output files. – Prakash K Sep 10 '13 at 21:50
  • @PrakashK: changing the resolution of a file to 72 ppi may be a completely unwanted side effect for the sake of speeding up the cropping... (And it will only speed it up in case the PDF contained lots of raster images -- text and vector graphics won't speed up...) – Kurt Pfeifle Sep 10 '13 at 22:34
  • @KurtPfeifle: I agree that specifying explicit resolution may not be acceptable in all cases. However, my (limited) experiments have shown that `-r72` has made a huge difference on the [file](http://lib.store.yahoo.net/lib/paulgraham/onlisp.pdf) which had only text and no graphics. It might be specific to my ghostscript installation, but the difference was 13 minutes (without the `-r72` option) vs 18 seconds. – Prakash K Sep 13 '13 at 20:27
  • @PrakashK: this seems to be an interesting file then. Would you mind sharing (a link to) it? – Kurt Pfeifle Sep 13 '13 at 23:28
  • @PrakashK: Ah, I only notice now: you've been using `pdfcrop`. This is a rather complicated Perl script, AFAIR. I can't recall how it exactly does work internally, but it seems to run Ghostscript to determine the bounding boxes for each page, using by default a high resolution (which gains not much advantage when determining the bounding box). This would explain the slowness you observed for `pdfcrop`... – Kurt Pfeifle Sep 13 '13 at 23:43
  • @KurtPfeifle: Here's the [link to the file](http://lib.store.yahoo.net/lib/paulgraham/onlisp.pdf), which was in my previous comment too. Yes, I was using `pdfcrop`, which @jacktrades, to whom my first comment in this thread was addressed to, was using too. The command invoked by `pdfcrop` to compute the `bbox` is: `gs -sDEVICE=bbox -dBATCH -dNOPAUSE -c save pop -f file.pdf`. If `--resolution XX` option was given to `pdfcrop` it is translated to to the ghostscript `-rXX` option. – Prakash K Sep 14 '13 at 02:05
  • @KurtPfeifle: I ran `pdfcrop` on another file of 900+ pages (with some images). This one the difference is not so dramatic as the previous one. Without the --resolution option it took 3.5 minutes, and with it 2 minutes. – Prakash K Sep 14 '13 at 02:50
  • 1
    @PrakashK: I just checked -- `bbox` device for some strange reason uses a default resolution of 4000 dpi. I had always assumed it would use 72 dpi. (I checked by running `gs -o /dev/null -sDEVICE=bbox -c "currentpagedevice {exch ==only ( ) print ==} forall quit" | grep -i resolution`. See also "[Querying Ghostscript for the default options/settings of an output device (such as 'pdfwrite' or 'tiffg4')](http://stackoverflow.com/a/11002313/359307)". – Kurt Pfeifle Sep 14 '13 at 05:09
  • @PrakashK: `bbox` needs a high resolution for preciseness of the `%%HiResBoundingBox:` values. At 72 dpi it can only report integer values, not decimal fractions. – Kurt Pfeifle Sep 14 '13 at 05:14
9

Try pdfcrop. It needs ghostscript.

Martin Schröder
  • 4,176
  • 7
  • 47
  • 81
  • 2
    Regarding the "huge file" problem, in the comments of [this blog post](http://alexsleat.co.uk/2011/01/25/using-pdfcrop-to-remove-white-margins-ubuntu/) they suggest to use `pdfcrop --xetex --resolution 72 [other-options] input.pdf output.pdf` to solve it. – Andrea Lazzarotto Jun 27 '14 at 23:07
  • 2
    Free, fast, automatically and correctly identifies margins, preinstalled. Just what I needed. – fuenfundachtzig Feb 19 '15 at 12:49
  • It won't work with pdfs that are password protected – Venktaish Nov 08 '20 at 18:35