6

This is a "meta-question" which I came across when trying to find a better specification for another of my questions (Rendering Devanagari ligatures (Unicode) in Java Swing JComponent on Mac OS X).

What I don't quite understand as of yet is which "component" (for want of a better word) of a given system is responsible for displaying Unicode text in Java, and more specifically ligatures.

As far as I understand, the following components have an influence on the process:

  1. The system character encoding (which for example is UTF-8 on Mac OS X 10.6, UTF-16 on Windows 7 (according to akira's comment on this superuser.com post)).
  2. The Java Charset (which by default is MacRoman on Mac OS X 10.6, cp1252 on Windows 7).
  3. The font that is used to render the text, and that font's encoding information (as suggested by Donal Fellows on my other question:

    "fonts include information about what encoding they're using".

  4. Obviously whether the characters to render are present at the respective Unicode code points.

So if a string of Unicode characters doesn't display correctly (as seen in my other question, s.a.), where would the problem most probably be? I.e., what "component" (what would a better word be?) is responsible for "binding" the ligature, its composition?

Thank you very much in advance and please let me know should you need more information.

Community
  • 1
  • 1
s.d
  • 4,017
  • 5
  • 35
  • 65
  • I would hazard a guess at the virtual machine, but I have no evidence or expertise in this matter. – Mr47 May 17 '11 at 14:34
  • @Mr47: Okay, thanks, that would be number (2) then. Which is where might come in handy I guess. I will keep that in mind. I have amended the post a bit to specify the "entry point" of my problem for others. – s.d May 17 '11 at 14:54

4 Answers4

4

That system component is called a font renderer or font rasterizer. It is responsible for converting a sequence of character codes into pixels based on glyphs defined in a font. As other answers have stated, the various character encoding values you can get and set from Java are irrelevant. When the JVM gives the font renderer a sequence of character codes, it tells it what encoding applies (probably UTF16, but this is transparent to the Java programmer.) The font renderer uses the font encoding specified in the font file to match up the corresponding glyphs.

Current versions of Windows and Mac OS X come with excellent font renderers.

The first point of confusion is that the JRE comes with its own font renderer, as part of the Java2D platform, and this is what Swing uses. There ought to be an option to control whether Java uses its own renderer or the system one.

EDIT: As McDowell pointed out in a comment, on OS X you can enable the system renderer by setting the Java property apple.awt.graphics.UseQuartz=true.

The second point of confusion is that ligatures are optional in English. A desktop publishing application will substitute an "ffl" ligature (a single glyph in the font) when it sees a word like "shuffle", but most other applications don't bother. Based on what you've said about Devanagari (and what I just read on Wikipedia) I gather the ligatures are not optional in that language.

By default, the Java2D font renderer does not do ligatures. However, the JavaDoc for java.awt.font.TextAttribute.LIGATURES says that ligatures are always enabled for writing systems that require them. If that isn't your experience, you may have found a bug in the Java2D font renderer. Meanwhile, try using the Font constructor that takes a map of font attributes, including TextAttribute.LIGATURES.

Community
  • 1
  • 1
gatkin
  • 1,902
  • 12
  • 12
  • Thanks very much for this interesting insight. You're right that ligatures are not optional at all in Devanagari. However, I've tested the `LIGATURES_ON` `TextAttribute` (as suggested by [Oracle](http://download.oracle.com/javase/tutorial/2d/text/textattributes.html), see below) and it didn't change a thing, unfortunately. Which leaves the fontconfig file as the most likeable source of the issue. `Map m = new Hashtable(); m.put(TextAttribute.LIGATURES, TextAttribute.LIGATURES_ON); font = font.deriveFont( map ); g2.setFont( font );` – s.d May 18 '11 at 13:41
  • 1
    @baphomet13 - it appears that you can use a [Java system property](http://developer.apple.com/library/mac/#documentation/Java/Reference/Java_PropertiesRef/Articles/JavaSystemProperties.html#//apple_ref/doc/uid/TP40008047) on OS X to switch between Java2D and Quartz rendering: `apple.awt.graphics.UseQuartz` – McDowell May 18 '11 at 22:30
  • @McDowell: Your last comment was spot on and solved my problem. May I suggest you edit it into your answer, so I can accept it as the best answer? Also, I've set a bounty on my related question [http://stackoverflow.com/questions/5994815/rendering-devanagari-ligatures-unicode-in-java-swing-jcomponent-on-mac-os-x], and I suggest you add your answer there as well so I can award the bounty to you! Many thanks again, you helped me a lot there! – s.d Jun 16 '11 at 15:25
  • @McDowell - despite bahomet13's suggestion, I just edited my answer to include a reference to your comment. – gatkin Jun 16 '11 at 20:31
3

I'm no expert, but hopefully these tips will point you in the right direction...

The encoding of source data has little bearing on how fonts are rendered. All character data in Java is UTF-16, so as long as you transcode information correctly from source to chars/strings, integrity of the data should be preserved.

However, note:

  • The AWT system can use the default system encoding to do font mapping
  • This is unlikely to apply for Devanagari (I am not aware of a legacy encoding that supports it)

AWT maps fonts is via the fontconfig file. On my Windows system, this maps to the Mangal font:

allfonts.devanagari=Mangal

No doubt a different font is being used on Mac OS.

Native text rendering was introduced sometime during the Java 6 lifetime - I don't know if that has any bearing on font support or just affects rendering speed/antialiasing/etc.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • Thank you for your tips! This very much sounds like it is what I was looking for, albeit my problems to describe the problem. I will need a little while to test it but will be sure to follow up on it here. – s.d May 17 '11 at 15:45
  • I have just queried a few Mac users and they all have fontconfig.properties mapping to Mangal for `allfonts.devanagari`. To be honest, I am now at a complete loss how to get my head round why there should be a difference between the Mac and the Windows display, thus I'd be grateful for any further hints. – s.d May 18 '11 at 16:22
  • @baphomet13 - assuming the `Mangal` font is identical on both platforms (and not different implementations under the same name) then I suspect [gatkin](http://stackoverflow.com/questions/6032401/which-system-component-is-responsible-for-binding-unicode-ligatures-in-a-java-app/6033769#6033769) is closer to the mark - the problem may be in _how_ the font is being rendered. – McDowell May 19 '11 at 09:04
2

If you refer strictly to the visual rendering, then "encoding" and related topics are no longer relevant: Rendering goes from String to visual display. The String has a defined (and unchangeable) encoding, which is UTF-16. So all questions like "did I read this binary stream with the correct encoding" have to be solved first.

The actual rendering of the the text must be done by the graphics subsystem. That would be AWT/Swing for "normal" Java or SWT or any other alternative system.

The first step (which is not strictly part of "rendering") is to convert some binary data to a String. This can involve platform default encoding iff the code doesn't specify some encoding explicitly. This is the step where encodings in general come into play. After that, we're in happy-happy-pure-Unicode-land.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Thank you very much for the specification of terms. I'm afraid my explanation wasn't very specific. Am I correct, however, in the assumption that the correct *display* of ligatures (e.g., लक्ष्मी, which is built using seven Unicode code points, or German ff) has to do with the *character encoding* (the `System` `Property` `"file-encoding"`)? – s.d May 17 '11 at 15:13
  • Also, I've changed the title and text to reflect your corrections. – s.d May 17 '11 at 15:19
  • @baphomet: no, the correct **display** does not. The question is: does your unicode data contains U+FB00 LATIN SMALL LIGATURE FF or does it contain 2 U+0066 LATIN SMALL LETTER F? – Joachim Sauer May 17 '11 at 15:27
  • My `String` contains seven code points (\u0932\u0915\u094D\u0937\u094D\u092E\u0940) which should display the Devanagari ligature लक्ष्मी (/laksmi/). I would expect the Unicode data written like that to be displayed as the ligature, which indeed it does on Windows 7 and Ubuntu machines, but not on Mac OS X. As ligatures in Devanagari are usually words they don't have single code points like LATIN SMALL LIGATURE FF. – s.d May 17 '11 at 15:38
  • It should be noted that U+FB00 is kind of a strange thing: Unicode generally **doesn't** provide separate codepoints for ligature (arguing that those are rendering decisions and not text information). That one (and similar ones) only exist for round-trip correctness with some much-used legacy encodings that could encode those ligatures. – Joachim Sauer May 17 '11 at 15:53
1

Similar to what Joachim said, what is the source of the data? If you're reading from a file or stream, I definitely would not trust the system default encoding. You should explicitly set the encoding when reading the data, e.g.

BufferedReader br = new BufferedReader( new InputStreamReader( file, "UTF-8" ) );

Or whatever encoding your stream is in.

See:

http://download.oracle.com/javase/1.4.2/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.lang.String)

Bill Brasky
  • 2,614
  • 2
  • 19
  • 20
  • Okay, I know now why I have triggered Joachim's response. In fact I do *not* read from a file but have defined a `String` variable with Unicode chars (e.g., `String str = "\u0932\u0915\u094D\u0937\u094D\u092E\u0940"`). These are *not* displayed correctly on a Mac system, but *are* displayed correctly on a Windows system which provoked my question. I will remove number (3) so that it won't trigger further answers about reading streams. Sorry, I thought I'd put that in for completeness' sake. – s.d May 17 '11 at 15:23