How to Extract pages from a PDF using IText 7?

Question

I trying to use the iText7 library to extract some pages from a PDF file to create a new one.

    static void Splitter()
        {
        string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
        string range = "1, 4, 8";
        var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
        var split = new PdfSplitter(pdfDocumentInvoiceNumber);
        var result = split.ExtractPageRange(new PageRange(range));
        var numberOfPagesPdfDocumentInvoiceNumber = result.GetNumberOfPages();
        String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
        var pdfWriter = new PdfWriter(toFile);
        var pdfDocumentInvoiceMergeResult = new PdfDocument(pdfWriter);
        for (var i = 1; i <= numberOfPagesPdfDocumentInvoiceNumber; i++) 
            { 
            var pdfPage = result.GetPage(i).CopyTo(pdfDocumentInvoiceMergeResult);
            pdfDocumentInvoiceMergeResult.AddPage(pdfPage);
            }
        }

But when I attempt to use CopyTo method I get the error

iText.Kernel.PdfException: 'Cannot copy indirect object from the document that is being written.'

The problem is clear from the error message and explained [here](https://stackoverflow.com/a/58434289/1729265) and [here](https://stackoverflow.com/a/53815830/1729265): *This restriction that pages cannot be copied from documents written to is due to the iText architecture: When a document is written to, iText attempts to push this new content out into the `PdfWriter` output stream as soon as possible and then forget about it. This allows iText to easily produce large result PDFs without requiring a large amount of memory. The downside is the restriction you're confronted with.* — mkl, Jun 04 '20 at 07:40
Another reason is that some structures in a document written to are not finalized before the document is closed, e.g. subset embedded fonts. — mkl, Jun 04 '20 at 07:44
So, how can I even use the method ExtractPageRange() properly with that kind of limitation? — H.Sou, Jun 04 '20 at 12:38

score 3 · Accepted Answer · answered Jun 05 '20 at 15:57

The problem here is that the documents returned by the PdfSplitter methods, in particular by ExtractPageRange, are iText 7 documents written to, i.e. these PdfDocument instances have been instantiated with a PdfWriter.

Such documents are subject to certain restrictions, in particular that pages cannot be copied from them. For details on this read the answers here and here.

To make these result documents (and the whole PdfSplitter class with them) be of any value, therefore, you need a way to define where the PdfWriter objects of these documents write to. And there is a way, albeit not really an intuitive way: You have to overwrite the GetNextPdfWriter method of the PdfSplitter which originally looks like this:

/// <summary>This method is called when another split document is to be created.</summary>
/// <remarks>
/// This method is called when another split document is to be created.
/// You can override this method and return your own
/// <see cref="iText.Kernel.Pdf.PdfWriter"/>
/// depending on your needs.
/// </remarks>
/// <param name="documentPageRange">the page range of the original document to be included in the document being created now.
///     </param>
/// <returns>the PdfWriter instance for the document which is being created.</returns>
protected internal virtual PdfWriter GetNextPdfWriter(PageRange documentPageRange) {
    return new PdfWriter(new ByteArrayOutputStream());
}

In a use case like yours in which you merely expect a single return document you eventually want to write to a file, you can do so like this:

class MySplitter : PdfSplitter
{
    public MySplitter(PdfDocument pdfDocument) : base(pdfDocument)
    {
    }

    protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
    {
        String toFile = @"C:\Users\Standard\Downloads\Result\Extracted.pdf";
        return new PdfWriter(toFile);
    }
}

With the PdfWriter instantiation moved into that custom splitter your main code is reduced to

string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
string range = "1, 4, 8";
var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
var split = new MySplitter(pdfDocumentInvoiceNumber);
var result = split.ExtractPageRange(new PageRange(range));
result.Close();

In a use case like yours this admittedly looks weird, having to derive a custom class from the PdfSplitter merely to extract a few pages from a source PDF to a result PDF. Wouldn't an additional PdfWriter parameter to the ExtractPageRange have made it much easier?

Please be aware, though, that the main objective of the PdfSplitter class is to split documents into many parts using the ExtractPageRanges and SplitBy... methods, and in that situation you'd need to supply a larger, probably not exactly known number of PdfWriters... not easier at all!

Of course, a better solution probably would have been injecting some lambda expression or some other callback mechanism. For example:

class ImprovedSplitter : PdfSplitter
{
    private Func<PageRange, PdfWriter> nextWriter;
    public ImprovedSplitter(PdfDocument pdfDocument, Func<PageRange, PdfWriter> nextWriter) : base(pdfDocument)
    {
        this.nextWriter = nextWriter;
    }

    protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange)
    {
        return nextWriter.Invoke(documentPageRange);
    }
}

you can use like this

string file = @"C:\Users\Standard\Downloads\Merged\CK 2002989 $29,514.42 02.12.20.pdf";
string range = "1, 4, 8";
var pdfDocumentInvoiceNumber = new PdfDocument(new PdfReader(file));
var split = new ImprovedSplitter(pdfDocumentInvoiceNumber, pageRange => new PdfWriter(@"C:\Users\Standard\Downloads\Result\Extracted.pdf"));
var result = split.ExtractPageRange(new PageRange(range));
result.Close();

Thank you! I ended up overwriting the methods. But I still wonder what kind of use the default return of this method would have. — H.Sou, Jun 06 '20 at 07:46
What if Instead of extracting result pdf into file stream, if I want to use a memory stream. actually, I have the same requirement. bt only concern is I do not have permission to create a file On my production server.so I have to store them in a memory stream. So Do you have any suggestions for that? — Karan Shah, Dec 02 '20 at 04:03
@KaranShah in your `GetNextPdfWriter` override you can create such memory streams and store them in a collection property of your splitter class. After splitting you can retrieve that collection from your splitter. Obvious collection choices would be either a list or a map based on the `PageRange` as corresponding key or value. — mkl, Dec 02 '20 at 07:21

schlebe · Answer 2 · 2023-01-16T20:24:43.827

The problem is linked to Splitter; but the extraction can be done without it !

Following code replace your code without error's messages.

    Private Sub TestCopyTo()
        Dim pdfInput = New PdfDocument(New PdfReader(sPdfInputFile))
        Dim iPageRange As Integer() = {2, 4, 8}
        Dim iLastPage = iPageRange.Length - 1

        Using pdfNew = New PdfDocument(New PdfWriter("result.pdf"))
            For i = 0 To iLastPage
                Dim iPage = iPageRange(i)
                Dim oNewPage As PdfPage = pdfInput.GetPage(iPage).CopyTo(pdfNew)
                pdfNew.AddPage(oNewPage)
            Next i
            pdfNew.Close()
        End Using
    End Sub

This is certainly more simple ... and do the job !

For your information, I have installed iText7 version 7.2.5 using Nuget tool on Visual Studio 2022 / Windows 11.

How to Extract pages from a PDF using IText 7?

2 Answers2

Linked