How can I combine multiple PDF files excluding page breaks using iTextSharp?

Question

I wonder if anyone has done this with iTextSharp, but I would like to combine multiple PDF files into one but leave the page breaks out. For example, I would like to create 4 PDF files containing 3 lines of text each, so I want the resulting file to have all 12 lines in 1 page. Is this possible?

Please be aware that each and every page is drawn on its own canvas. Thus, there is no *page break* to *leave out* but there are multiple canvasses from which to cut out parts and project onto a common canvas. Thus, it is not as trivial as *leaving out a page break* sounds. But as you are the creator of the PDFs you have control over them and it is feasible. Are you sure you only need some lines of text on each of the pages? — mkl, Jan 16 '15 at 16:47
Each PDF has just a few lines, perhaps a table or an image, but I want the end result in one page. I've tried some code I've found out there, but it still places a page break for each PDF combined. — alozada, Jan 16 '15 at 17:25
Also, some of the PDFs may contain fields, so I'd like to keep those fields in the resulting combined PDF as well. — alozada, Jan 16 '15 at 17:29
The first challenge is to find the area with content. This is not a trivial lookup, instead all drawing instructions have to be inspected. If there is header or footer material, please indicate how it can be recognized. — mkl, Jan 16 '15 at 19:16
Fields (I assume you mean AcroForm fields) complicate things further add they are stored as a separate structure. — mkl, Jan 16 '15 at 19:23

score 3 · Accepted Answer · edited May 23 '17 at 12:07

As the OP also tagged this question with [iText] and I am more at home with Java than .Net, here an answer for iText/Java. It should be easy to translate to iTextSharp/C#.

The original question

I would like to combine multiple PDF files into one but leave the page breaks out. For example, I would like to create 4 PDF files containing 3 lines of text each, so I want the resulting file to have all 12 lines in 1 page.

For PDF files as indicated in that example you can use this simple utility class:

public class PdfDenseMergeTool
{
    public PdfDenseMergeTool(Rectangle size, float top, float bottom, float gap)
    {
        this.pageSize = size;
        this.topMargin = top;
        this.bottomMargin = bottom;
        this.gap = gap;
    }

    public void merge(OutputStream outputStream, Iterable<PdfReader> inputs) throws DocumentException, IOException
    {
        try
        {
            openDocument(outputStream);
            for (PdfReader reader: inputs)
            {
                merge(reader);
            }
        }
        finally
        {
            closeDocument();
        }

    }

    void openDocument(OutputStream outputStream) throws DocumentException
    {
        final Document document = new Document(pageSize, 36, 36, topMargin, bottomMargin);
        final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
        document.open();
        this.document = document;
        this.writer = writer;
        newPage();
    }

    void closeDocument()
    {
        try
        {
            document.close();
        }
        finally
        {
            this.document = null;
            this.writer = null;
            this.yPosition = 0;
        }
    }

    void newPage()
    {
        document.newPage();
        yPosition = pageSize.getTop(topMargin);
    }

    void merge(PdfReader reader) throws IOException
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int page = 1; page <= reader.getNumberOfPages(); page++)
        {
            merge(reader, parser, page);
        }
    }

    void merge(PdfReader reader, PdfReaderContentParser parser, int page) throws IOException
    {
        TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
        Rectangle pageSizeToImport = reader.getPageSize(page);
        float heightToImport = finder.getHeight();
        float maxHeight = pageSize.getHeight() - topMargin - bottomMargin;
        if (heightToImport > maxHeight)
        {
            throw new IllegalArgumentException(String.format("Page %s content too large; height: %s, limit: %s.", page, heightToImport, maxHeight));
        }

        if (heightToImport > yPosition - pageSize.getBottom(bottomMargin))
        {
            newPage();
        }
        else if (!writer.isPageEmpty())
        {
            heightToImport += gap;
        }
        yPosition -= heightToImport;

        PdfImportedPage importedPage = writer.getImportedPage(reader, page);
        writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (finder.getLly() - pageSizeToImport.getBottom()));
    }

    Document document = null;
    PdfWriter writer = null;
    float yPosition = 0; 

    final Rectangle pageSize;
    final float topMargin;
    final float bottomMargin;
    final float gap;
}

If you have a list of PdfReader instances inputs, you can merge them like this into an OutputStream output:

PdfDenseMergeTool tool = new PdfDenseMergeTool(PageSize.A4, 18, 18, 5);
tool.merge(output, inputs);

This creates a merged document using an A4 page size, a top and bottom margin of 18/72" each and a gap between contents of different PDF pages of 5/72".

The comments

The iText TextMarginFinder (used in the PdfDenseMergeTool above) only considers text. If other content types also are to be considered, this class has to be extended somewhat.

Each PDF has just a few lines, perhaps a table or an image, but I want the end result in one page.

If the tables contain decorations reaching above or below the text content (e.g. lines or colored backgrounds), you should use a larger gap value. Unfortunately the parsing framework used by the TextMarginFinder does not forward vector graphics commands to the finder.

If the images are bitmap images, the TextMarginFinder should be extended by implementing its renderImage method to take the image area into account, too.

Also, some of the PDFs may contain fields, so I'd like to keep those fields in the resulting combined PDF as well.

If AcroForm fields are also to be considered, you have to

extend the rectangle represented by the TextMarginFinder to also include the visualization rectangles of the widget annotations, and
extend the PdfDenseMergeTool.merge(PdfReader, PdfReaderContentParser, int) method to also copy those widget annotations.

Update

I wrote above

Unfortunately the parsing framework used by the TextMarginFinder does not forward vector graphics commands to the finder.

Meanwhile (in version 5.5.6) that parsing framework has been extended to also forward vector graphics commands.

If you replace the line

TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());

by

MarginFinder finder = parser.processContent(page, new MarginFinder());

using the MarginFinder class presented at the bottom of this answer, all content is considered, not merely text.

**This is SO close to what I NEED** My scenario 3 PDFs to merge as: HEADER (1/2 page), BODY (X pages) and FOOTER (2/3 page) This code works great IF the next page can fit onto the current partially filled page. Mine will never do that since my BODY file has at least one full page. It will ALWAYS start on its own page because of that. SEE NEXT COMMENT FOR CONTINUATION — Grandizer, Mar 12 '15 at 12:55
What I need is to just remove the white space between the HEADER and where the BODY starts. Then if there is room at the end of the BODY, put the FOOTER after it (even if it gets chopped into two pages.) Is there a way to really just grab all of the content of all of the pages (assume the content has no white space) and then just spit that out into a new PDF and let it break the pages as needed? — Grandizer, Mar 12 '15 at 12:55
*Is there a way to really just grab all of the content of all of the pages (assume the content has no white space) and then just spit that out into a new PDF and let it break the pages as needed?* - Well, to *let it break the pages as needed* is a problem: the content of each page is drawn on its very own canvas. So you don't tell PDF "up to here on this page, from here on on that page". It is not hopeless, though. If you can share some sample files, please make it a question in its own right referencing those samples. — mkl, Mar 12 '15 at 13:28
Ah, I see, you already have another question. Can you add representative sample files to it? — mkl, Mar 12 '15 at 13:45
Okay, @mkl I have added sample Header, Body and Footer files to [My post](http://stackoverflow.com/questions/28991291/itextsharp-how-to-remove-whitespace-on-merge) — Grandizer, Mar 12 '15 at 16:50

score 2 · Answer 2 · answered Mar 12 '15 at 13:02

For those of you who want the above code in C#, here you go.

using System;
using System.Collections.Generic;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace Test.WebService.Support {

  public class PDFMerge {

    private Rectangle PageSize;
    private float TopMargin;
    private float BottomMargin;
    private float Gap;
    private Document Document = null;
    private PdfWriter Writer = null;
    private float YPosition = 0;

    public PDFMerge(Rectangle size, float top, float bottom, float gap) {
      this.PageSize = size;
      this.TopMargin = top;
      this.BottomMargin = bottom;
      this.Gap = gap;
    } // PDFMerge

    public void Merge(MemoryStream outputStream, List<PdfReader> inputs) {
      try {
        this.OpenDocument(outputStream);

        foreach (PdfReader reader in inputs) {
          this.Merge(reader);
        }
      } finally {
        this.CloseDocument();
      }
    } // Merge

    private void Merge(PdfReader reader) {
      PdfReaderContentParser parser = new PdfReaderContentParser(reader);

      for (int p = 1; p <= reader.NumberOfPages; p++) {
        this.Merge(reader, parser, p);
      }
    } // Merge

    private void Merge(PdfReader reader, PdfReaderContentParser parser, int pageIndex) {
      TextMarginFinder Finder = parser.ProcessContent(pageIndex, new TextMarginFinder());
      Rectangle PageSizeToImport = reader.GetPageSize(pageIndex);
      float HeightToImport = Finder.GetHeight();
      float MaxHeight = PageSize.Height - TopMargin - BottomMargin;

      if (HeightToImport > MaxHeight) {
        throw new ArgumentException(string.Format("Page {0} content too large; height: {1}, limit: {2}.", pageIndex, HeightToImport, MaxHeight));
      }

      if (HeightToImport > YPosition - PageSize.GetBottom(BottomMargin)) {
        this.NewPage();
      } else if (!Writer.PageEmpty) {
        HeightToImport += Gap;
      }

      YPosition -= HeightToImport;

      PdfImportedPage ImportedPage = Writer.GetImportedPage(reader, pageIndex);
      Writer.DirectContent.AddTemplate(ImportedPage, 0, YPosition - (Finder.GetLly() - PageSizeToImport.Bottom));
    } // Merge

    private void OpenDocument(MemoryStream outputStream) {
      Document Document = new Document(PageSize, 36, 36, this.TopMargin, BottomMargin);
      PdfWriter Writer = PdfWriter.GetInstance(Document, outputStream);
      Document.Open();
      this.Document = Document;
      this.Writer = Writer;
      this.NewPage();
    } // OpenDocument

    private void CloseDocument() {
      try {
        Document.Close();
      } finally {
        this.Document = null;
        this.Writer = null;
        this.YPosition = 0;
      }
    } // CloseDocument

    private void NewPage() {
      Document.NewPage();
      YPosition = PageSize.GetTop(TopMargin);
    } // NewPage

  }
}

The completed answer with both Java and C# code can be found [here](http://stackoverflow.com/questions/28991291/how-to-remove-whitespace-on-merge) — Grandizer, Mar 18 '15 at 13:10

How can I combine multiple PDF files excluding page breaks using iTextSharp?

2 Answers2

The original question

The comments

Update

Linked