0

PDF viewers are not rendering all of the tamil letters as expected.

Below is the actual content rendering in PDF viewer enter image description here

Below is the expected content enter image description here

From my understanding, these are the three cases requiring the substitution or change for Tamil letters.

Reverse the glyphs,

        கெ = க + ெ =  க ெ  ->  ெ + க = கெ 

Split and reorder the glyphs

        கொ = க + ொ  = க ொ  ->    க + ெ + ா  ->  ெ + க + ா = கொ
                                    

Substitute new glyphe for a series of glyphes. The new glyphe do not have unicode, only exist in the font file.

        கு = க + ு = க ு -> கு            
Input text Char list from JDK Code points from JDK gid in ttf Actual* Expected
கெ க + ெ 2965 3014 Character : க Codepoint : 2965 unicode : ub95 Character : ெ Codepoint : 3014 unicode : ubc6 1828 1856 க + ெ = க ெ ெ + க = கெ Reversing the glyphes expected.
கொ க + ொ 2965 3018 Character : க Codepoint : 2965 unicode : ub95 Character : ொ Codepoint : 3018 unicode : ubca 1828 1859 க + ொ = க ொ க + ெ + ா ெ + க + ா = கொ Split and reorder expected.
கு க + ு 2965 3009 Character : க Codepoint : 2965 unicode : ub95 Character : ு Codepoint : 3009 unicode : ubc1 1828 1854 க + ு = க ு கு (gid = 6698) New glyphe expected. The new glyphe do not have unicode, only exist in the font file.

How to handle these substitutions in an efficient way?

Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be appreciated.

Links, Font Actual Expected Use cases PDFBox Jira

Jeyan
  • 729
  • 1
  • 14
  • 27

1 Answers1

1

You need to implement a text shaping engine to handle Tamil writing.

Please see the OpenType specification: https://learn.microsoft.com/en-us/typography/opentype/spec/ , the GSUB/GPOS tables are the main interest for you.

This is no easy task so maybe using an external library such as HarfBuzz is a better choice.

There is also this PDFBox issue (4189) regarding Bengali writing. Maybe it will help you implement support for Tamil

Update: for example this HarfBuzz command line:

hb-shape -O json -u U+0B95,U+0BC1 --no-glyph-names FreeSerif.otf

will return:

[{"g":6698,"cl":0,"dx":0,"dy":0,"ax":858,"ay":0}]

You have to parse the json output, get the glyph ids and provide them to PDFBox.

iPDFdev
  • 5,229
  • 2
  • 17
  • 18