Splitting a large Pdf file with PDFBox gets large result files

Question

I am processing some large pdf files, (up to 100MB and about 2000 pages), with pdfbox. Some of the pages contain a QR code, I want to split those files into smaller ones with the pages from one QR code to the next. I got this, but the result file sizes are the same as the source file. I mean, if I cut a 100MB pdf file into a ten files I am getting ten files 100MB each.

This is the code:

 PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("myFile.pdf"), 
                           new RandomAccessFile(new File("./tmp/temp"), "rw"));

    int numPages = documentoPdf.getNumberOfPages();
    List pages = documentoPdf.getDocumentCatalog().getAllPages();

    int previusQR = 0;
    for(int i =0; i<numPages; i++){
       PDPage page = (PDPage) pages.get(i);
       BufferedImage firstPageImage =    
           page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);

       String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);

       if(qrText != null and i!=0){
         PDDocument outputDocument = new PDDocument();
         for(int j = previusQR; j<i; j++){
           outputDocument.importPage((PDPage)pages.get(j));
          }
         File f = new File("./splitting_files/"+previusQR+".pdf");
         outputDocument.save(f);
         outputDocument.close();
         documentoPdf.close();
    }

I also tried the following code for storing the new file:

PDDocument outputDocument = new PDDocument();

for(int j = previusQR; j<i; j++){
 PDStream src = ((PDPage)pages.get(j)).getContents();
 PDStream streamD = new PDStream(outputDocument);
 streamD.addCompression();

 PDPage newPage = new PDPage(new   
           COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
 newPage.setContents(streamD);

 byte[] buf = new byte[10240];
 int amountRead = 0;
 InputStream is = null;
 OutputStream os = null;
 is = src.createInputStream();
 os = streamD.createOutputStream();
 while((amountRead = is.read(buf,0,10240)) > -1) {
    os.write(buf, 0, amountRead);
  }

 outputDocument.addPage(newPage);
}

File f = new File("./splitting_files/"+previusQR+".pdf");

outputDocument.save(f);
outputDocument.close();

But this code creates files which lacks some content and also have the same size than the original.

How can I create smaller pdfs files from a larger one? Is it posible with PDFBox? Is there any other library with which I can transform a single page into an image (for qr recognition), and also allows me to split a big pdf file into smaller ones?

Thx!

What version are you using? Can you share the PDF? The effect you describe may happen if each page references all resources of all pages, instead of just the ones it is really using. — Tilman Hausherr, Feb 17 '16 at 14:26
I am using 1.8.9 version (I am compiling with Java 1.6) You can download the file [here](https://drive.google.com/open?id=0B0cAeEoswLtlMGZ2MWtJUVFaYUE "pdf") I generated it using [PDF_Chain](http://pdfchain.sourceforge.net/ "pdf_chain") — Nuria, Feb 18 '16 at 07:11
Current version is 1.8.11 or 2.0 RC3. I tried the PDFSplit command utility with the first chunk, the result file (p 1- 59) is 1.7 MB. I'll try your code tonight to see if there's a difference. — Tilman Hausherr, Feb 18 '16 at 07:33

score 3 · Accepted Answer · answered Feb 18 '16 at 09:21

Thx! Tilman you are right, the PDFSplit command generates smaller files. I checked the PDFSplit code out and found that it removes the page links to avoid not needed resources.

Code extracted from Splitter.class :

private void processAnnotations(PDPage imported) throws IOException
    {
        List<PDAnnotation> annotations = imported.getAnnotations();
        for (PDAnnotation annotation : annotations)
        {
            if (annotation instanceof PDAnnotationLink)
            {
                PDAnnotationLink link = (PDAnnotationLink)annotation;   
                PDDestination destination = link.getDestination();
                if (destination == null && link.getAction() != null)
                {
                    PDAction action = link.getAction();
                    if (action instanceof PDActionGoTo)
                    {
                        destination = ((PDActionGoTo)action).getDestination();
                    }
                }
                if (destination instanceof PDPageDestination)
                {
                    // TODO preserve links to pages within the splitted result  
                    ((PDPageDestination) destination).setPage(null);
                }
            }
            else
            {
                // TODO preserve links to pages within the splitted result  
                annotation.setPage(null);
            }
        }
    }

So eventually my code looks like this:

PDDocument documentoPdf = 
        PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));

        int numPages = documentoPdf.getNumberOfPages();
        List pages = documentoPdf.getDocumentCatalog().getAllPages();


        int previusQR = 0;
        for(int i =0; i<numPages; i++){
            PDPage firstPage = (PDPage) pages.get(i);
            String qrText ="";


            BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);


            firstPage =null;

            try {
                qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
            } catch (NotFoundException e) {
                e.printStackTrace();
            } finally {
                firstPageImage = null;
            }


        if(i != 0 && qrText!=null){
                    PDDocument outputDocument = new PDDocument();
                    outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
                    outputDocument.getDocumentCatalog().setViewerPreferences(
                            documentoPdf.getDocumentCatalog().getViewerPreferences());


                    for(int j = previusQR; j<i; j++){
                        PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));

                        importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
                        importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
                        // only the resources of the page will be copied
                        importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
                        importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );

                        processAnnotations(importedPage);


                    }


                    File f = new File("./splitting_files/"+previusQR+".pdf");

                    previusQR = i;

                    outputDocument.save(f);
                    outputDocument.close();
                }
            }


        }

Thank you very much!!

Wow you're good. (I just found out the same but you beat me). I'll think of a good additional comment for the importPage javadoc. — Tilman Hausherr, Feb 18 '16 at 16:39
Note the two TODO. You could call processAnnotations() for each page after having created your new document with all pages, and then check wether the page to be "nulled" is in your destination document or not. — Tilman Hausherr, Feb 18 '16 at 16:42
Please press the green checkmark (if available). You won't get any points for this but the question will appear as answered. Please change also the title to your question into something like "... still gets large result files" or whatever other text with the same meaning so that more "victims" of this problem will find it. I've already changed the javadoc of 2.0 and will do this for 1.8 later as well. — Tilman Hausherr, Feb 19 '16 at 09:10

Splitting a large Pdf file with PDFBox gets large result files

1 Answers1

Linked