4

i'm using iText 2.1.5 to merge 2 PDF files. My problem is, when the concatenated pdf is generated, all fonts used in both pdf are duplicated. Is there a better way to handle this, so that the fonts only get embedded once ?

source code :

    public class GroupingPDF {

    public static final String RESULT = "/home/asagaama/Documents/groupementpdf/get/concatenated.pdf";;

    public static void main(String[] args) {
        try {
            String[] files = {
                    "/home/asagaama/Documents/groupementpdf/get/1.pdf",
                    "/home/asagaama/Documents/groupementpdf/get/2.pdf" };
            Document document = new Document();
            PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document,
                    new FileOutputStream(RESULT));
            document.open();
            PdfReader reader;

            int n;
            // loop over the documents you want to concatenate
            for (int i = 0; i < files.length; i++) {
                reader = new PdfReader(files[i]);
                // loop over the pages in that document
                n = reader.getNumberOfPages();
                for (int page = 0; page < n;) {
                    pdfSmartCopy.addPage(pdfSmartCopy.getImportedPage(reader,
                            ++page));
                }
            }
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
  • Are you talking about embedded fonts or not? If embedded, are you talking about subsets of fonts or full fonts? If subsets, no one can help you. If non-embedded fonts or fully embedded fonts, you are using the wrong way to concatenate your PDFs (and the wrong iText version for that matter, you should upgrade). – Bruno Lowagie Jan 29 '15 at 09:34
  • I'm talking about Embedded subset font, why if subsets no one can help me ? –  Jan 29 '15 at 09:45
  • 1
    Because you aren't merely asking to remove duplicate fonts (which is a no-brainer if you're talking about fonts that aren't embedded or fonts that are fully embedded), you're asking to merge different subsets of the same font into one font object. This is quite complex and may require entire content streams to be rewritten. With iText, you can only reduce the file size by reusing font subsets that are **identical**. The moment they are different, you can't reuse different subsets unless you merge them (and merging subsets isn't supported). – Bruno Lowagie Jan 29 '15 at 10:00
  • Bruno, that's not my goal, i want only to remove duplicate Embedded subset fonts, so to sum up, i have for exemple pdf 1 with font ( Arial MT TrueType (CID) Embedded subset) and pdf 2 with fonts ( Arial MT TrueType (CID) Embedded subset and Helvetica.ttf TrueType (CID) Embedded subset), after merging the pdfs, I got one pdf with duplicate font ( Arial MT TrueType (CID) Embedded subset) , so i want to have one Arial MT TrueType (CID) Embedded subset. I searched in the internet, and the solution was to use PDFSmartCopy, but it does not work for me. –  Jan 29 '15 at 10:09
  • 1
    The answer is indeed `PdfSmartCopy` but why doesn't it work for you? (1) because the subsets are **different** (if one font has "a, b, c" and the other font has "d, e, f" the font objects are **different** and `PdfSmartCopy` won't help you), or (2) you are using a version of iText dating from March 2009, *deliberately* ignoring all improvements that have been added in the last 6 years. You should upgrade and test with the most recent version. – Bruno Lowagie Jan 29 '15 at 10:18
  • I tested with IText 4.2.1 version, but the same problem :( , how can I solve the problem Bruno ? i want to have in the output ( Arial MT TrueType (CID) Embedded subset ,Helvetica.ttf TrueType (CID) Embedded subset) and not ( Arial MT TrueType (CID) Embedded subset ,Helvetica.ttf TrueType (CID) Embedded subset ,Arial MT TrueType (CID) Embedded subset) –  Jan 29 '15 at 10:28
  • There is no such thing as iText 4.2.1. If you found such a version, it was not distributed by iText Software. The most recent version is iText 5.5.4. Also: you are not being clear. I am going to vote to close the question because you keep on talking about embedded subsets. If you have embedded subsets, there may not be a solution for you. Why do you ignore that fact? You may have a reason (for instance, if you **know** that the subsets are identical), but you're not telling me. – Bruno Lowagie Jan 29 '15 at 10:33
  • I'll test with iText 5.5.4... –  Jan 29 '15 at 10:36
  • An embedded subset is always marked with a random code consisting of 6 letters and a plus sign (+). Read [What are the extra characters in the font name of my PDF?](http://stackoverflow.com/questions/16580270/). Tell me: when you say you have two instances of "Arial MT TrueType", is the 6-letter prefix identical? If not, you are out of luck: the subsets are different and you need software that can merge two different subsets. – Bruno Lowagie Jan 29 '15 at 10:37
  • I tested with itextpdf-5.5.1 version and the problem persists, i am sure that the 6-letter prefix of "Arial MT TrueType" in the two pdfs are identical because, it's me who created the two pdfs using jasperreports, and i'm using the same font Arial MT. Bruno, can I post the code to see if there is a problem ? –  Jan 29 '15 at 10:56
  • Actually, if you post a question on StackOverflow, you are supposed to post a [SSCCE](http://sscce.org) that reproduces the problem. In other words: you need to add the code and the PDFs that can be used to reproduce the problem on StackOverflow. So by all means, update your question and add the code as well as links to the PDFs. – Bruno Lowagie Jan 29 '15 at 11:06

1 Answers1

3

I have examined your file, and I have taken screen shots of the font resources used by each page:

page 1: Page 1

We see 5 fonts:

  1. EUDXLQ+FranklinGothic-Book
  2. XQNBQD+FranklinGothicLT-Book
  3. DDOZBL+Helvetica.ttf
  4. DYKMGD+FranklinGothicLT-Demi
  5. MZDMJV+ArialMT

page 2: page 2

We see 4 fonts:

  1. KWKZVU+FranklinGothic-Book (a subset of FranklinGothic-Book that is different from the one on page 1)
  2. WUQFPY+FranklinGothicLT-Book (a subset of FranklinGothicLT that is different from the one on page 1)
  3. SQWYVD+FranklinGothicLT-Demi (a subset of FranklinGothicLT-Demi that is different from the one on page 1)
  4. ZKEBIA+ArialMT (a subset of ArialMT that is different from the one on page 1)

page 3: page 3

This page has 2 fonts:

  1. KWKZVU+FranklinGothic-Book (The same subset as on page 2, it is also the same object: object 34 0)
  2. ZKEBIA+ArialMT (The same subset as on page 2, it is also the same object: object 44 0)

page 4: page 4

This page has 2 fonts:

  1. KWKZVU+FranklinGothic-Book (The same subset as on page 2, it is also the same object: object 34 0)
  2. ZKEBIA+ArialMT (The same subset as on page 2, it is also the same object: object 44 0)

If this is the result of using PdfSmartCopy, then iText has done its job wel. Identical subsets of fonts are stored in the same object (no redundant font).

Unfortunately, ArialMT and some of the FranklinGothic fonts can not be reused because the subsets of the fonts are different. iText isn't able to merge different font sets of the same font.

I have already explained this in the comments, but then you made some allegations that were not true. Only after you shared the document, I was able to prove that your question was based on false assumptions.

Update:

What are your options if you want to concatenate PDFs and reduce the number of fonts?

If you don't embed the font, then iText will detect identical font dictionaries and iText will remove the redundant font dictionaries. The same is true if you embed the full font (so if you don't allow the PDF producer to create a subset). However, embedding the full font isn't always an option. Depending on the font, this could result in files with a much higher file size.

We have font improvements on our technical road map, but I don't think merging font subsets into a single font isn't part of that sub-project. In some cases, it's probably feasible to implement this in iText, e.g. in cases where a predictable encoding is used. In other cases, merging different subsets will be nearly impossible because it would require rewriting entire content streams.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Thanks a lot Bruno for your response, so I want to know, is there a solution to make the font Arial MT for example, with same random code in all pdfs ? Thanks ! –  Jan 29 '15 at 13:02