Reading Glyphs from a String using codePointAt(i) or Charseterset issue

Question

I created a text editor for JavaFx which is painting the text on a Canvas, gyph by glyph. I use String.codePointAt(i) to correctly load the glyphs. Somehow the first glyph is a strange one, I don't know why. The file was loaded using Charset UTF-16 LE

Here is the rendered string, the first glyph is strange:

And here you can see the textLine and ch bytes after the first character:

And here is the code I use to iterate a text line:

int i = 0;
while (i < textLine.length()) {
   int codePoint = textLine.codePointAt(i);
   i += Character.charCount(codePoint);
   String ch = Character.toString( cp );
   graphicContext.fillText( ch, x, y);
}

Is this code wrong or it is an encoding and file issue?

Signed bytes `-1,-2` represent [Byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) for _UTF-16 LE_ (`U+FEFF` _Zero Width No-Break Space_). See [Byte order mark screws up file reading in Java](https://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java) — JosefZ, Sep 21 '21 at 15:52

score 3 · Answer 1 · answered Sep 21 '21 at 15:38

3

The first character is U+FEFF which is a BOM (byte order marker). You should skip over this and not display it. Your code is mostly all right, but it is going character by character, not glyph by glyph. Consider the case where the text contains a character followed by a combining accent, or even multiple combining accents. You would try to render these separately, when they should be rendered as a single glyph.

I believe you can get the glyphs with their combining accents using a BreakIterator. See in particular the discussion of character boundary analyis.

If you know that combining accents are not an issue for you, though, then your current approach is fine. Just skip any initial BOM.

answered Sep 21 '21 at 15:38

David Conrad

15,432
2
42
54

The editor should be able to render any text, including Korean, Cyrillic, Japanese, etc. Probably for this case, I would need to switch to the BreakIterator, right? But I saw this one is also bound to a certain locale: BreakIterator.getSentenceInstance(Locale.US); How should I use it? – DbSchema Sep 22 '21 at 05:34
@DbSchema Yes, in that case I would switch to the BreakIterator, but I'm not sure how to handle the locales. Maybe just using Locale.ROOT would work. – David Conrad Sep 22 '21 at 16:58

Reading Glyphs from a String using codePointAt(i) or Charseterset issue

1 Answers1