0

I have a sql server db. In there are many, many rows. Each row has a column that contains a stored pdf.
The db is a gig in size. So we can expect roughly half that size is due to the pdfs.
now I have a requirement to join all those pdf's ... into 1 pdf. Don't ask why. Can you suggest the best way forward and which component will be best suited for this job. There are many answers available:

How can I join two PDF's using iTextSharp?
Merge memorystreams to one itext document
How to merge multiple pdf files (generated in run time)?

as to how to join two (or more pdfs). But what I'm asking for is in terms of performance. We literally dealing with around 50 000 pdfs that need to be merged into 1 almighty pdf

[Edit Solution] Brought time to merge 1000 pdfs from 4m30s to 21s

public void MergePDFs(string targetPDF, string sourceDir)
        {
            using (FileStream stream = new FileStream(targetPDF, FileMode.Create))
            {
                var files = Directory.GetFiles(sourceDir);

                Document pdfDoc = new Document(PageSize.A4);
                PdfCopy pdf = new PdfCopy(pdfDoc, stream);
                pdfDoc.Open();

                Console.WriteLine("Merging files count: " + files.Length);
                int i = 1;
                var watch = System.Diagnostics.Stopwatch.StartNew();
                foreach (string file in files)
                {
                    Console.WriteLine(i + ". Adding: " + file);
                    pdf.AddDocument(new PdfReader(file));
                    i++;
                }

                if (pdfDoc != null)
                    pdfDoc.Close();

                watch.Stop();
                var elapsedMs = watch.ElapsedMilliseconds;
                MessageBox.Show(elapsedMs.ToString());
            }
        }
Community
  • 1
  • 1
Eminem
  • 7,206
  • 15
  • 53
  • 95
  • Size is not a problem, I've tested the creation of PDFs with 10G and it works. The problem is that 50000 PDFs will take a long time to merge and consume a lot of memory, that's a consequence of the PDF format and it will be bad whatever you use. You may consider the use of collections, not a merge but rather a PDF containing the other PDFs that can be selected. – Paulo Soares Aug 11 '16 at 22:58
  • What do you mean by collections? – Eminem Aug 12 '16 at 03:38
  • See [http://developers.itextpdf.com/examples/miscellaneous/clone-portable-collections](http://developers.itextpdf.com/examples/miscellaneous/clone-portable-collections). – Paulo Soares Aug 12 '16 at 16:00

1 Answers1

0

I just did a C#/Winforms project with PDFSharp and merging images to PDFs and it worked phenomenally with a traditional folder structure. I imagine that it would work similarly with data stored PDFs so long as you can pull them into a memory stream first then merge them.

Some suggestions: 1) Recommend doing it in a multi-threaded environment so you can work on multiple PDFs at a time. 2) Open only what you need and close as soon as the operation is complete. So say you have three documents that need to be merged into one. Create a blank PDF. Open first into a memory stream, open blank. Append first to blank. Close first, save blank, close blank. Repeat for second and third. This way you control how much memory you are taking up at any one point in time. In this way I was able to append millions of images, but control memory usage. 3) Ensure you are using the Using statements when utilizing objects. This will help with memory cleanup and eliminate the need for calling garbage collector which is looked down upon. 4) Separate your business (work) from your UI as best you can so you can cancel the operation at any point in time, or view current status as it progresses through. 5) Log everything that is done so that you can go back and correct one-offs for the PDFs that didn't make it through the first pass.

dkolln
  • 81
  • 5
  • MemoryStreams blazed a trail of glory. Brought my time to merge 1000 pdfs from 4m30s to 21s! – Eminem Aug 13 '16 at 08:04