1

I am using iTextSharp, with C# in Visual Studio 2010 and I've recently encountered the following situation. I've received several ebooks split into numerous PDF files, these files contained galley marks in the borders and I removed them using the following code:

    x = reader.GetPageSize(i).Width;
    y = reader.GetPageSize(i).Height;
    iTextSharp.text.Rectangle tRect = 
      new iTextSharp.text.Rectangle(x - 52, y - 52);
    Document document = new Document(tRect);
    PdfWriter writer = PdfWriter.GetInstance(document, 
      new FileStream(dest, FileMode.OpenOrCreate));

    document.Open();
    PdfContentByte content = writer.DirectContent;
    PdfImportedPage page = writer.GetImportedPage(reader, i);

    content.AddTemplate(page, -offset, -offset);

    document.NewPage();

    document.SetMargins(0, 0, 0, 0);
    document.Close();
    reader.Close();

Of course, this is enclosed in a For loop with i as the ordinal. After I've iterated through each of the pages in the portion I'm working on, I use the following code to merge them together:

    private void mergePDF(string fName, string folderPath)
    {
        string[] files = Directory.GetFiles(folderPath);
        iTextSharp.text.Document tDoc = new iTextSharp.text.Document();
        iTextSharp.text.pdf.PdfCopy copy = 
          new iTextSharp.text.pdf.PdfCopy(tDoc, 
            new FileStream(fName, FileMode.Create));
        tDoc.Open();
        iTextSharp.text.pdf.PdfReader reader;
        int n = 0;
        for (int i = 0; i < files.Length; i++)
        {
            reader = new iTextSharp.text.pdf.PdfReader(files[i]);
            n = reader.NumberOfPages;
            for (int page = 0; page < n; )
            {
                copy.AddPage(copy.GetImportedPage(reader, ++page));
            }
            copy.FreeReader(reader);
            reader.Close();
        }
        tDoc.Close();
    }        

After this is complete, I find that my file size is doubled (one file in particular weighed in at 20,180KB before processing and 41,322KB after processing)!

I did some digging and it seems that when splitting PDFs with iTextSharp the program embeds all of the fonts for the complete PDF in each PDF that is split off and apparently this can account for 50-80% of the file size.

That being said, does anyone know of a way to remove the embedded fonts from a PDF using iTextSharp. My plan is to include them only in the first PDF file and then when the PDF is recompliled there will only be one copy of the fonts in the document and my sizes will be more appropriate.

Also of note, this code is a close approximation of my actual code - the logic is identical but some variables have been added for size and flow considerations.

Thomas Hawkins
  • 103
  • 2
  • 12
  • 2
    Okay, I found my own answer. Instead of using PdfCopy I used PdfSmartCopy and in the above mentioned case I went from 41,322KB after processing to 16,854KB (and the original was 20,180KB) a difference of almost 3.5MB from original! – Thomas Hawkins Jan 31 '13 at 17:46
  • 1
    While PdfSmartCopy instead of PdfCopy is the correct way to remove duplicate streams, your code to remove the galley marks is designed in a not very optimal way. You could achieve the same effect using a single run with a PdfStamper to manipulate the media box entries of the document pages instead of extracting and inserting whole page contents. Cf. [iText - how to move down the current contents in a pdf](http://stackoverflow.com/questions/12798583/itext-how-to-move-down-the-current-contents-in-a-pdf/12813721). Or you may actually simply want to create appropriate crop box entries... – mkl Feb 01 '13 at 07:46

0 Answers0