0

I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this. All examples seem to be pointing at removed code (font.setToUnicodeor some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks, JM

Community
  • 1
  • 1
jmc34
  • 810
  • 3
  • 10
  • 22
  • Which PDFBox version do you use? As the answer you refer to points out, the situation differs for versions 1.8.x and 2.0.x – mkl Feb 24 '17 at 17:25
  • I am using 2.0.3 (the last one published). – jmc34 Feb 24 '17 at 18:58
  • Which font do you use? How do you use it? Pdfbox 2.0.x allows you to embed font subsets which contain the glyphs you need. – mkl Feb 25 '17 at 09:29
  • @mkl yes I tried with Ubuntu fonts which improved things to a point, but it is never going to be good enough as I cannot know in advance what characters will be printed. I am printing text that are user inputs and basically they have access to the whole UTF-8 set. Is there a way to know what glyphs are in a font for what code points ? That would be massively inefficient but I could scan all strings and replace missing characters by a place hloder... – jmc34 Feb 25 '17 at 13:02
  • https://stackoverflow.com/a/31424164/3977077 This helps a lot to remove non-printable characters – Sherlock Aug 24 '17 at 02:04

2 Answers2

0

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < string.length(); i++) {

        int codePoint = string.codePointAt(i);

        if (codePoint == 0x000A) {
            sb.appendCodePoint(codePoint);
            continue;
        }

        String fontName = font.getName();
        int cpKey = fontName.hashCode();
        cpKey = 31 * cpKey + codePoint;

        if (codePointAvailCache.get(cpKey) == null) {

            try {
                font.encode(string.substring(i, i + 1));
                codePointAvailCache.put(cpKey, true);
            } catch (Exception e) {
                codePointAvailCache.put(cpKey, false);
            }
        }

        if (!codePointAvailCache.get(cpKey)) {

            // Need to make sure our font has a replacement character
            try {
                codePoint = 0xFFFD;
                font.encode(new String(new int[] { codePoint }, 0, 1));
            } catch (Exception e) {
                codePoint = 0x003F;
            }
        }

        sb.appendCodePoint(codePoint);
    }

    return sb.toString();
}
jmc34
  • 810
  • 3
  • 10
  • 22
0

I am tackling the same problem and copied java.awt.Font.canDisplayUpTo() method while adapting it to use PDFont instead. Thanks to @jmc34 and their sanitization method for the font.encode() example and the inspiration!

private static int canDisplayUpTo(String text, PDFont font) {
    int len = text.length();
    for (int i = 0; i < len; i++) {
        try {
            font.encode(text.substring(i, i + 1));
        } catch (IOException | IllegalArgumentException e) {
            return i;
        }
    }
    return -1;
}

Because I had already written these methods using Font.canDisplayUpTo() and it was working beautifully until I had to use PDFont.

public static String getFontSupportedString(String text) {
    try {
        Font font = Font.createFont(Font.TRUETYPE_FONT, getBoldFontStream());
        return replaceUnsupportedGlyphs(text, font);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

private static String replaceUnsupportedGlyphs(String text, Font font) {
    int failIndex = font.canDisplayUpTo(text);
    if (failIndex == -1) {
        return text;
    } else if (failIndex < text.length()) {
        return text.substring(0, failIndex)
                + REPLACEMENT_CHARACTER
                + replaceUnsupportedGlyphs(text.substring(failIndex + 1), font);
    } else {
        return text + REPLACEMENT_CHARACTER;
    }
}

The only change was to replace font.canDisplayUpTo(text) with my canDisplayUpTo(text, font) method, and we were back in business.

Final Code:

public static String getFontSupportedString(String text) {
    try {
        PDFont font = PDType0Font.load(new PDDocument(), getBoldFontStream());
        return replaceUnsupportedGlyphs(text, font);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

private static String replaceUnsupportedGlyphs(String text, PDFont font) {
    int failIndex = canDisplayUpTo(text, font);
    if (failIndex == -1) {
        return text;
    } else if (failIndex < text.length()) {
        return text.substring(0, failIndex)
                + REPLACEMENT_CHARACTER
                + replaceUnsupportedGlyphs(text.substring(failIndex + 1), font);
    } else {
        return text + REPLACEMENT_CHARACTER;
    }
}

private static int canDisplayUpTo(String text, PDFont font) {
    int len = text.length();
    for (int i = 0; i < len; i++) {
        try {
            font.encode(text.substring(i, i + 1));
        } catch (IOException | IllegalArgumentException e) {
            return i;
        }
    }
    return -1;
}

I have not done efficiency testing, and so make no claims in that regard. Cheers!

Sztiv
  • 1
  • 1