Unable to merge 2 PDFs using MemoryStream

Question

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.

The properties object contains the html as a string, and the argument for landscape/portrait.

System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;

properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;

System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);

try
{
    PDF.WriteTo(file);
    PDF.Flush();

    PDF_portrait.WriteTo(file_portrait);
    PDF_portrait.Flush();

    finalStream.WriteTo(file_combined);
    finalStream.Flush();
}
catch (Exception)
{
    throw;
}
finally
{
    PDF.Close();
    file.Close();

    PDF_portrait.Close();
    file_portrait.Close();

    finalStream.Close();
    file_combined.Close();
}

The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).

I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:

Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.

Below are a few things that I have tried out already, to no avail:

Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself

In case it is required, below is the elaboration of the GetPdfStream() method.

var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;

Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
    process.Start();
    process.BeginErrorReadLine();

    var inputTask = Task.Run(() =>
    {
        htmlStream.CopyTo(process.StandardInput.BaseStream);
        process.StandardInput.Close();
    });

    // Copy the output to a memorystream
    MemoryStream pdf = new MemoryStream();
    var outputTask = Task.Run(() =>
    {
        process.StandardOutput.BaseStream.CopyTo(pdf);
    });

    Task.WaitAll(inputTask, outputTask);

    process.WaitForExit();

    // Reset memorystream read position
    pdf.Position = 0;

    return pdf;
}
catch (Exception ex)
{
    throw ex;
}
finally
{
    process.Dispose();
}

Pdf is a structured file format, which means it consist of many tiny parts to build a full document. See section 7.5 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf This is the format that the pdf readers read as well. They expect to find 'Header, Body, Cross-reference table, Trailer' in one file, but instead they find 'Header, Body, Cross-reference table, Trailer, Header, Body, Cross-reference table, Trailer'. You'll need a library that understands this format (easiest), or write one yourself (specification is in the document I've mentioned earlier). — Caramiriel, Aug 26 '19 at 06:56
@Caramiriel This makes a lot of sense. Could you please make this an answer. I would like to mark this as solved — Sanketh. K. Jain, Aug 26 '19 at 07:08
Duplicate: https://stackoverflow.com/q/808670/2441442 (Can not be closed while on bounty) — Christian Gollhardt, Aug 30 '19 at 19:41
@ChristianGollhardt While the aforementioned question has been answered with the implementation to the problem, it doesn't tell me why I should use a library. The answer that I was looking for was either an explanation as provided by Matthew and Caramiriel, or a code solution without a library (which I now realise is an unreasonable expectation). Request you to reconsider. Thanks. — Sanketh. K. Jain, Sep 02 '19 at 00:59

Maytham Fahmi · Answer 1 · 2019-09-01T20:23:46.443

Merging pdf in C# or any other language is not straight forward with out using 3rd party library.

I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.

I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.

I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.

The code is self explaining, the key point here is using SerializationModeEnum.Incremental:

public static void MergePdf(string srcPath, string destFile)
{
    var list = Directory.GetFiles(Path.GetFullPath(srcPath));
    if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
        return;
    var files = list.Select(File.ReadAllBytes).ToList();
    using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
    {
        var document = dest.Document;
        var builder = new org.pdfclown.tools.PageManager(document);
        foreach (var file in files.Skip(1))
        {
            using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
            { builder.Add(src.Document); }
        }

        dest.Save(destFile, SerializationModeEnum.Incremental);
    }
}

To test it

var srcPath = @"C:\temp\pdf\input";
var destFile = @"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);

Input examples
PDF doc A and PDF doc B

Output example

Links to my research:

Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.

Thank you for taking the time to write this down. Choosing not to use a library was a decision given to me by my seniors. Your answer helps me provide part of the explanation to use a library, but it is Matthew's answer and Caramiriel's comment (on the question) that helped me make a strong case in favor of a library, and hence I've marked the answer — Sanketh. K. Jain, Sep 02 '19 at 00:46
@Sanketh.K.Jain sure that is fine, the comments link and Matthew answer is useful indeed. but was not a ware that you was looking for theoretical part of it. — Maytham Fahmi, Sep 02 '19 at 03:51

score 7 · Answer 2 · edited Aug 30 '19 at 19:59

7

This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:

        using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
        using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
        using (PdfDocument outPdf = new PdfDocument())
        {
            CopyPages(one, outPdf);
            CopyPages(two, outPdf);

            outPdf.Save("file1and2.pdf");
        }

        void CopyPages(PdfDocument from, PdfDocument to)
        {
            for (int i = 0; i < from.PageCount; i++)
            {
                to.AddPage(from.Pages[i]);
            }
        }

edited Aug 30 '19 at 19:59

Christian Gollhardt

16,510
17
74
111

answered Aug 26 '19 at 06:41

Alexander Bruun

247
2
10

I'm looking for something without PdfSharp – Sanketh. K. Jain Aug 26 '19 at 06:43
2

@Sanketh.K.Jain MemoryStream exclusively or are other technologies allowed? (https://stackoverflow.com/a/32225966/6925434) – Alexander Bruun Aug 26 '19 at 06:45
Just C# exclusively. No other technologies. I have my PDFs in a stream as of now, which were generated as an output of the wkhtmltopdf. – Sanketh. K. Jain Aug 26 '19 at 06:47
I don't see why you can't use another nuget package when you're already using wkhtmltopdf, but that is just my opinion. – Alexander Bruun Aug 26 '19 at 06:52
I understand. But that's the requirement I've been handed :P – Sanketh. K. Jain Aug 26 '19 at 06:56
We usually don't handle duplicates by reposting them. And even when we do, [attribution is required](https://stackoverflow.blog/2009/06/25/attribution-required/). – Christian Gollhardt Aug 30 '19 at 19:48

score 4 · Answer 3 · edited Aug 30 '19 at 19:53

4

That's not quite how PDFs work. PDFs are structured files in a specific format. You can't just append the bytes of one to the other and expect the result to be a valid document.

You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.

edited Aug 30 '19 at 19:53

Christian Gollhardt

16,510
17
74
111

answered Aug 30 '19 at 19:37

IOrlandoni

1,790
13
30

score 3 · Accepted Answer · answered Sep 01 '19 at 23:34

PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.

In order to merge 2 PDFs you'll need to manipulate the streams.

First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.

Then you can write the body of the first page, and then the second.

Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.

Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.

See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/

This is not trivial and you'll end up re-writing lots of code that already exists.

Really appreciate this insight. And this has helped me make a case to my seniors for using a pdf library. — Sanketh. K. Jain, Sep 02 '19 at 00:51

Unable to merge 2 PDFs using MemoryStream

4 Answers4