6

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.

The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.

Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.

The command I am using is:

gs \
   -q \
   -dNOPAUSE \
   -sDEVICE=pdfwrite \
   -sOutputFile=/tmp/outputfilename \
   -dBATCH \
    /var/www/documents/docs/input1.pdf \
    /var/www/documents/docs/input2.pdf \
    /var/www/documents/docs/input3.pdf 

Anyone else that have experienced this, or even better know a solution for it?

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Mr R
  • 95
  • 1
  • 6

3 Answers3

10

I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).

Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:

 for i in input*.pdf; do
     pdffonts ${i} | tee ${i}.pdffonts.txt
 done

Look for the font names used in each PDF.

My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.

The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...

It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).

This leads to some characters missing in merged output.

I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).

I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.

What you could do to avoid this problem

  1. Try to always embed fonts as full sets, not subsets:

    • I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
    • If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
    • If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
    • If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
      gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file

    Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.

  2. Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.

    • I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
    • If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
    • If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
    • If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
      gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file

    Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.

  3. Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...


Update

To answer once more, but this time more explicitly (following the extra question of @sacohe in the comments below. In many (not all) cases the following procedure will work:

  • Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).

  • The command to use is this (or similar):
    gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf

The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).

This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • wow Kurt this was really helpful. I will investigate this further and also try to get some samples to send you to dig into. I be back with an update. – Mr R Oct 10 '12 at 07:09
  • Kurt. I have put together the files for you if you are still interested. Can I send them to you privately since I don't want to share them to the public. – Mr R Oct 10 '12 at 12:31
  • Okey I couldn't get the pdffonts util to work well on my mac but I wrote a script that ran 'strings inputfile.pdf |grep FontName' and it gave me that every single input file to GS (39 of them) embedded: /FontName /QRAAAA+NimbusSanL-Regu /FontName /QWAAAA+NimbusSanL-Bold So no unique font prefix there. The output file from GS seems to have embedded them all 39 times but I am not sure about that :-) I am happy to share with you the files to verify if you like. – Mr R Oct 10 '12 at 13:23
  • @MrR: You didn't mention you're on a Mac. On Mac, install [*MacPorts*](http://www.macports.org/install.php) and then run `sudo port -p install ghostscript`. – Kurt Pfeifle Oct 10 '12 at 15:04
  • 1
    @MrR: If every single input file of the 39 uses only these two font names (`QRAAAA+NimbusSanL-Regu` and `QWAAAA+NimbusSanL-Bold`) then you've not just one, but two different fonts using non-uniq name prefixes! -- – Kurt Pfeifle Oct 10 '12 at 15:07
  • @MrR: ...also, from the *Nimbus* font names I can already see that these 39 files must have had Ghostscript used as their creator/producer tool. You can run either `pdfinfo` or `strings input.pdf | grep -E '(Creator|Producer)'` to check if I'm correct. – Kurt Pfeifle Oct 10 '12 at 15:09
  • @MrR: you can use `myfirstname.myfamilyname@gmail.com` to send me files if you want... – Kurt Pfeifle Oct 10 '12 at 15:10
  • Thank you for the great explanation. I'm having the same problem but I am also using wkhtmltopdf to produce my input files - did anyone find a solution for that case yet? – sacohe Dec 04 '12 at 21:55
  • I found a solution to my problem so I wanted to post it in case it helps others. In my case, the two PDFs with the same fonts were pages 2 and 3 in the final PDF. Regardless of which order I put them in, the second was always missing characters. I found that if I merged 1 and 2 using GhostScript, and then merged the result of that with 3, then the fonts were embedded by GhostScript separately and didn't cause the original problem. This may not be a practical solution for other people, but it worked for me. – sacohe Dec 06 '12 at 03:12
  • 1
    @sacohe: I added an update to my answer which states the fix more explicitly. The real fix is to re-process each input file individually through Ghostscript in a direct *PDF->PDF* conversion. – Kurt Pfeifle Dec 25 '12 at 11:39
2

I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.

you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. @Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?

another ugly method that halfway works is to use:

-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false

which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.

(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)

Alexander
  • 2,320
  • 2
  • 25
  • 33
q7joey
  • 21
  • 1
  • So your platform is RedHat/CentOS? Which version? Ghostscript 8.70 is too old (well before 2010), and the re-processing trick won't work with it. -- Try this [statically linked version of GS 9.06 (32bit)](http://downloads.ghostscript.com/public/binaries/ghostscript-9.06-linux-x86.tgz) which should suffice to test if the trick works for your files... – Kurt Pfeifle Jan 25 '13 at 08:54
0

From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while @Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts

Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.

nzifnab
  • 15,876
  • 3
  • 50
  • 65