Is it possible to efficiently split a PDF into individual pages (using FPDI)?

Question

I am trying to split large files into individual pages, using PHP's FPDI library.

For some reason, splitting the file does not do much to reduce the file size. For example, the following script applied to a 30 page 1MB file results in 30 files of around 0.9MB, i.e. resulting in total of around 26MB!

It suggests to me that a big portion of original file is retained, even though it is not required.

Questions:

Is this avoidable?
Is this a bug in FPDI?
Is there an alternative PHP library that is more efficient at splitting?

More detail

I've reproduced this issue in a variety of configurations:

FPDI version 1 (no longer supported) and FPDI version 2
Using FPDF and TCPDF
PHP 5.4 and PHP 5.6
Various PDF files, including files generated using FPDF and TCPDF

Here is some PHP code to illustrate the issue:

<?php

testPdfSplit();

function testPdfSplit()
{
    echo phpversion();

    //Load a file
    $contentPath = "/path/to/local/files/original_file.pdf";
    copy("https://file-examples.com/wp-content/uploads/2017/10/file-example_PDF_1MB.pdf", $contentPath);
    $numpages = 30;

    //Get the original file size
    $fileSize = round(filesize($contentPath) / (1024 * 1024), 3);
    echo "<p>Original file is $fileSize MB</p>";

    for($i=1; $i<=$numpages; $i++)
    {
        echo "<p>Creating file with $i pages</p>";
        $filePath = "/path/to/local/files/test.$i.pdf";

        try
        {
            selectOnePage($content, $i, $filePath);
        }
        catch (Exception $e)
        {
            die ("<pre>ERROR: $e</pre>");
        }

        $fileSize = round(filesize($filePath) / (1024 * 1024),3);
        echo "<p>$filePath is $fileSize MB</p>";
    }
}

function selectOnePage($filePathIn, $pageNo, $filePathOut)
{
    require_once('fpdf/fpdf.php');
    require_once('fpdi/src/autoload.php');

    // initiate FPDI
    $pdf = new \setasign\Fpdi\Fpdi();

    // get the page count
    $pageCount = $pdf->setSourceFile($filePathIn);

    echo "<p>Selecting page $pageNo / $pageCount</p>";

    // import a page
    $pdf->AddPage();
    $templateId = $pdf->importPage($pageNo);
    $pdf->useImportedPage($templateId);

    //output the file
    $pdf->Output($filePathOut, 'F');
}

score 3 · Answer 1 · answered Aug 21 '19 at 13:26

3

FPDI does not analyze the used resources of an imported page and copies all referenced resources.

If a document e.g. has only a single resource dictionary (a common structure), all resources are copied.

We also offer a commercial (non-free) tool for merging and splitting PDF documents. The SetaPDF-Merger component. By default this tool has the same problem but we'd prepared a demo with some code, that removes unused resources after the split process. You can find the demo and code here.

answered Aug 21 '19 at 13:26

Jan Slabon

4,736
2
14
29

Thanks @Jan - do you know of any way of generating a PDF *without* a single resource dictionary? (I'm using TCPDF to generate my files) – IanS Aug 21 '19 at 14:47
If TCPDF uses a single resource dictionary it's not that easy to change. For generating a single dictionary is very easy as you can reuse any resource on any page you want without registering it in individual dictionaries. – Jan Slabon Aug 21 '19 at 15:04

score 0 · Answer 2 · answered Aug 21 '19 at 14:44

This appears to be a general problem with most PDF tools - it is also a problem with pdftk and cpdf, as described in pdftk split pdf with multiple pages.

Most PDFs I have come across have a single resource dictionary, so it can't be done easily (Thanks to @Jan Slabon for the explanation).

Is it possible to efficiently split a PDF into individual pages (using FPDI)?

2 Answers2