0

I am trying to split the single PDF into multiple. Like 10 page document into 10 single page document.

PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();

Here the problem is, the new document's page size is different than original document. So some text are cropped or missing in new document. I am using PDFBox 2.0 and how I can avoid this?

UPDATE: Thanks @mkl.

Splitter did the magic. Here is the updated working part,

public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
      throws IOException {

    File file = new File(meta.getFilename());

    Splitter splitter = new Splitter();
    splitter.setStartPage(meta.getStart());
    splitter.setEndPage(meta.getEnd());
    splitter.setSplitAtPage(meta.getEnd());

    List<PDDocument> docs = splitter.split(source);
    if(docs.size() > 0){
      PDDocument output = docs.get(0);
      output.save(file);
      output.close();
    }
  }

public class SplitMeta {

  private String filename;
  private int start;
  private int end;

  public SplitMeta() {
  }
}
jaks
  • 4,407
  • 9
  • 53
  • 68
  • 1
    Unfortunately you don't share a sample document to reproduce the issue. I would assume that the PDF page in question has inherited properties which are not copied over by `PDDocument.addPage`. Can you share a sample document to analyze? – mkl May 30 '16 at 14:21

1 Answers1

4

Unfortunately the OP has not provided a sample document to reproduce the issue. Thus, I have to guess.

I assume that the issue is based in objects not immediately linked to the page object but inherited from its parents.

In that case using PDDocument.addPage is the wrong choice as this method only adds the given page object to the target document page tree without consideration of inherited stuff.

Instead one should use PDDocument.importPage which is documented as:

/**
 * This will import and copy the contents from another location. Currently the content stream is stored in a scratch
 * file. The scratch file is associated with the document. If you are adding a page to this document from another
 * document and want to copy the contents to this document's scratch file then use this method otherwise just use
 * the {@link #addPage} method.
 * 
 * Unlike {@link #addPage}, this method does a deep copy. If your page has annotations, and if
 * these link to pages not in the target document, then the target document might become huge.
 * What you need to do is to delete page references of such annotations. See
 * <a href="http://stackoverflow.com/a/35477351/535646">here</a> for how to do this.
 *
 * @param page The page to import.
 * @return The page that was imported.
 * 
 * @throws IOException If there is an error copying the page.
 */
public PDPage importPage(PDPage page) throws IOException

Actually even this method might not suffice as is as it does not consider all inherited attributes, but looking at the Splitter utility class one gets an impression what one has to do:

PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources 
processAnnotations(imported);

making use of the helper method

private void processAnnotations(PDPage imported) throws IOException
{
    List<PDAnnotation> annotations = imported.getAnnotations();
    for (PDAnnotation annotation : annotations)
    {
        if (annotation instanceof PDAnnotationLink)
        {
            PDAnnotationLink link = (PDAnnotationLink)annotation;   
            PDDestination destination = link.getDestination();
            if (destination == null && link.getAction() != null)
            {
                PDAction action = link.getAction();
                if (action instanceof PDActionGoTo)
                {
                    destination = ((PDActionGoTo)action).getDestination();
                }
            }
            if (destination instanceof PDPageDestination)
            {
                // TODO preserve links to pages within the splitted result  
                ((PDPageDestination) destination).setPage(null);
            }
        }
        // TODO preserve links to pages within the splitted result  
        annotation.setPage(null);
    }
}

As you are trying to split the single PDF into multiple, like 10 page document into 10 single page document, you might want to use this Splitter utility class as is.

Tests

To test those methods I used the output of the PDF Clown sample output AnnotationSample.Standard.pdf because that library heavily depends on inheritance of page tree values. Thus, I copied the content of its only page to a new document using either PDDocument.addPage, PDDocument.importPage, or Splitter like this:

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithAddPage)

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithImportPage)

PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();

(CopyPages.java test testWithSplitter)

Only the final test copied the page faithfully.

mkl
  • 90,588
  • 15
  • 125
  • 265