PDF Tamil writing using PDFBox

Question

PDF viewers are not rendering all of the tamil letters as expected.

Below is the actual content rendering in PDF viewer

Below is the expected content

From my understanding, these are the three cases requiring the substitution or change for Tamil letters.

Reverse the glyphs,

        கெ = க + ெ =  க ெ  ->  ெ + க = கெ

Split and reorder the glyphs

        கொ = க + ொ  = க ொ  ->    க + ெ + ா  ->  ெ + க + ா = கொ

Substitute new glyphe for a series of glyphes. The new glyphe do not have unicode, only exist in the font file.

        கு = க + ு = க ு -> கு

Input text	Char list from JDK	Code points from JDK	gid in ttf	Actual*	Expected
கெ	க + ெ	2965 3014 Character : க Codepoint : 2965 unicode : ub95 Character : ெ Codepoint : 3014 unicode : ubc6	1828 1856	க + ெ = க ெ	ெ + க = கெ	Reversing the glyphes expected.
கொ	க + ொ	2965 3018 Character : க Codepoint : 2965 unicode : ub95 Character : ொ Codepoint : 3018 unicode : ubca	1828 1859	க + ொ = க ொ	க + ெ + ா ெ + க + ா = கொ	Split and reorder expected.
கு	க + ு	2965 3009 Character : க Codepoint : 2965 unicode : ub95 Character : ு Codepoint : 3009 unicode : ubc1	1828 1854	க + ு = க ு	கு (gid = 6698)	New glyphe expected. The new glyphe do not have unicode, only exist in the font file.

How to handle these substitutions in an efficient way?

Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be appreciated.

iPDFdev · Answer 1 · 2022-03-09T11:21:30.067

You need to implement a text shaping engine to handle Tamil writing.

Please see the OpenType specification: https://learn.microsoft.com/en-us/typography/opentype/spec/ , the GSUB/GPOS tables are the main interest for you.

This is no easy task so maybe using an external library such as HarfBuzz is a better choice.

There is also this PDFBox issue (4189) regarding Bengali writing. Maybe it will help you implement support for Tamil

Update: for example this HarfBuzz command line:

hb-shape -O json -u U+0B95,U+0BC1 --no-glyph-names FreeSerif.otf

will return:

[{"g":6698,"cl":0,"dx":0,"dy":0,"ax":858,"ay":0}]

You have to parse the json output, get the glyph ids and provide them to PDFBox.

1 Answers1