Split a Pdf into byte array pages with IText7

Question

I need to split a Pdf file into byte array pages without using the file system. I found the next code from @AlexeySubach which seems to work, but I have problems to export the contents from DocumentReadyListener:

class ByteArrayPdfSplitter : PdfSplitter {

    private MemoryStream currentOutputStream;

    public ByteArrayPdfSplitter(PdfDocument pdfDocument) : base(pdfDocument) {
    }

    protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange) {
        currentOutputStream = new MemoryStream();
        return new PdfWriter(currentOutputStream);
    }

    public MemoryStream CurrentMemoryStream {
        get { return currentOutputStream; }
    }

    public class DocumentReadyListender : IDocumentReadyListener {

        private ByteArrayPdfSplitter splitter;

        public DocumentReadyListender(ByteArrayPdfSplitter splitter) {
            this.splitter = splitter;
        }

        public void DocumentReady(PdfDocument pdfDocument, PageRange pageRange) {
            pdfDocument.Close();
            byte[] contents = splitter.CurrentMemoryStream.ToArray();
            String pageNumber = pageRange.ToString();
        }
    }
}

Usage:

    public static List<Byte[]> SplitOnPages(Byte[] bytes)
    {
        using (MemoryStream memoryStream = new MemoryStream(bytes))
        {
            using (PdfReader reader = new PdfReader(memoryStream))
            {
                PdfDocument docToSplit = new PdfDocument(reader);
                ByteArrayPdfSplitter splitter = new ByteArrayPdfSplitter(docToSplit);
                splitter.SplitByPageCount(1, new ByteArrayPdfSplitter.DocumentReadyListender(splitter));
            }
        }

        //How do I get here the array of byte array pages??
        return ...
    }

Well, I need to skip filesystem since the code is part of a website hosted in Azure. But the problem is more related to C# than specific to Pdf. I tested the code and it works, it's just I don't see how to get the result of each page from variable contents on DocumentReadyListener.DocumentReady back to the main thread — Matias Masso, Jun 29 '23 at 06:56
So what you really want is people to explain this code to you. This isn't about C#. This code is incomplete and won't even compile. There's no `SplitByPageCount` and the code that's included doesn't try to access any pages. — Panagiotis Kanavos, Jun 29 '23 at 07:17
`I found the next code from @AlexeySubach which seems to work` where's that code? Do you have a link to it? What do you even mean by `split a PDF into byte array pages`? A PDF is a file containing print commands. The commands that specify a page and its contents can't stand by themselves. Are you trying to extract those pages into separate PDF files? — Panagiotis Kanavos, Jun 29 '23 at 07:21
Check the latest answer to [How to Extract pages from a PDF using IText 7?](https://stackoverflow.com/questions/62187647/how-to-extract-pages-from-a-pdf-using-itext-7). iText 7 makes this very easy. You can get each page with `var page=myPDF.GetPage(iPage);` and then copy and add it it to a new PDF file with `var newPage=page.CopyTo(newPDF); newPDF.AddPage(newPage);` This is also shown in the iText 7 docs, in [Chapter 6: Reusing existing PDF documents | .NET](https://kb.itextpdf.com/home/it7kb/ebooks/itext-jump-start-tutorial-for-net/chapter-6-reusing-existing-pdf-documents-net) — Panagiotis Kanavos, Jun 29 '23 at 07:30
If you check the documentation of the PdfPage object you'll see there's a [PdfPage.GetContentBytes()](https://api.itextpdf.com/iText/dotnet/8.0.0/classi_text_1_1_kernel_1_1_pdf_1_1_pdf_page.html#ac0e423781b4c80af985ddc503da93bef) method too. Other methods on the same object suggest there may be multiple content streams in the same page, but GetContentBytes seems to return all the content bytes. I'm not sure if that contains any metadata, comments etc. It certainly doesn't contain the fonts used by the entire document — Panagiotis Kanavos, Jun 29 '23 at 07:37
@Panagoitis, you may find Alexei Subach code on first answer at https://stackoverflow.com/questions/46375760/itext-7-0-4-0-converting-pdfdocument-to-byte-array . The code compiles perfectly well, and if you set a breakpoint at contents var you see how it gets loaded at each page. My problem was how to get this value out of the function. Mkl has posted an elegant solution which is exactly the answer I was looking for. — Matias Masso, Jun 29 '23 at 16:56
Is that better than the 2 lines of code possible in 2023 with iText 7? — Panagiotis Kanavos, Jun 30 '23 at 07:07

score 1 · Accepted Answer · answered Jun 29 '23 at 11:03

The code from Alexey Subach you found expects that you add some sensible operation in the DocumentReadyListender method DocumentReady. As you eventually want a list of result PDF bytes, you should in your case add the bytes of the ready document to such a list, e.g. by improving the DocumentReadyListender like this:

public class DocumentReadyListender : IDocumentReadyListener
{
    public List<byte[]> splitPdfs;

    private ByteArrayPdfSplitter splitter;

    public DocumentReadyListender(ByteArrayPdfSplitter splitter, List<byte[]> results)
    {
        this.splitter = splitter;
        this.splitPdfs = results;
    }

    public void DocumentReady(PdfDocument pdfDocument, PageRange pageRange)
    {
        pdfDocument.Close();
        byte[] contents = splitter.CurrentMemoryStream.ToArray();
        splitPdfs.Add(contents);
    }
}

(ByteArrayPdfSplitter, improved helper class DocumentReadyListender)

With that change you can make your SplitOnPages operational:

public static List<Byte[]> SplitOnPages(Byte[] bytes)
{
    List <byte[]> result = new List<byte[]>();
    using (MemoryStream memoryStream = new MemoryStream(bytes))
    {
        using (PdfReader reader = new PdfReader(memoryStream))
        {
            PdfDocument docToSplit = new PdfDocument(reader);
            ByteArrayPdfSplitter splitter = new ByteArrayPdfSplitter(docToSplit);
            splitter.SplitByPageCount(1, new DocumentReadyListender(splitter, result));
        }
    }

    return result;
}

(SplitInMemory test, improved method SplitOnPages)

Thanks @mkl, this is exactly the answer I was looking for! – Matias Masso Jun 29 '23 at 16:57 — Matias Masso, Jun 29 '23 at 16:57

Split a Pdf into byte array pages with IText7

1 Answers1