5

I have for example an Arabic string "يبسش" (sorry, I put the characters randomly), these are four Arabic characters and I want to get UTF8 codes of each one. But when I get the characters with an enumerator, I get the isolated form of each one instead the used form.

Is there a way to get the utf-8 codes of these characters with the used form?

ouflak
  • 2,458
  • 10
  • 44
  • 49
Jesús Galindo
  • 471
  • 1
  • 5
  • 11
  • 1
    This MSDN link probably answers your question: http://msdn.microsoft.com/en-us/library/vstudio/7h9tk6x8(v=vs.100).aspx – dotNET Aug 28 '14 at 11:12
  • 2
    So... what is the *expected* output that you are looking for here? – Marc Gravell Aug 28 '14 at 11:20
  • 2
    Arabic has different forms for the same character based on its position. For example, ههه is the same character, ه, three times. If I understand correctly, Jesús wants to get the Unicode code point of each distinct form? I didn't think they even *had* distinct Unicode points, but [apparently they do](http://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Contextual_forms) (If all of this is obvious you are in luck, but I think this background is missing, or at least the question could be more explicit) – Kobi Aug 28 '14 at 11:25
  • 1
    @Kobi: *I didn't think they even had different Unicode points, but apparently they do*. This has been a source of confusion for me as well. Isn't Unicode about **characters** and not **glyphs**? If yes, why should it assign different code points to different forms of the same character? – dotNET Aug 28 '14 at 11:32
  • 1
    It is the job of the text rendering engine to generate the ligatures. When you look at the individual Unicode codepoints then you only ever see the non-combined glyphs. Necessarily so, the rendering engine doesn't have anything to combine it with. This is of course not a real problem. – Hans Passant Aug 28 '14 at 11:35
  • 1
    @HansPassant - I also tried to get the code points (as integers or bytes), and always get the same number, so it isn't just the rendering or context. It's a real problem if I want to draw the characters on an image, each with a different color. That might be a silly example, but I'd like to know how it's done. – Kobi Aug 28 '14 at 11:41
  • 1
    You of course have to draw the word. Necessary for more than one reason, measuring the required space for the text can also not be done accurately by measuring individual glyphs and summing them. – Hans Passant Aug 28 '14 at 11:45
  • 2
    @Kobi: FYI: I just found that contextual forms are only there in Unicode for historical/legacy reasons and backward compatibility. Any modern application will always store text characters using original forms only, not the contextual forms. Glyph construction is the job of the rendering engine (Uniscribe in Windows case). – dotNET Aug 28 '14 at 13:45
  • 2
    [This SO post](http://stackoverflow.com/questions/1169709/how-do-i-get-the-characters-for-context-shaped-input-in-a-complex-script) suggests that it is quite difficult to achieve what you're after. You'll have to dive into Uniscribe (Windows text rendering engine) for it. – dotNET Aug 28 '14 at 13:47
  • 1
    The character is the same, it's the glyph that varies. If you knew which positional form you wanted, you should be able to get it by putting ZWJ (U+200D) / ZWNJ (U+200C) before/after each character, but to know which positional form to use, you'll probably need to know the Unicode Joining_Type of the character itself, and of the preceding and following characters, ignoring those with Joining_type=Transparent. Doesn't appear to be that easy. – ninjalj Sep 12 '14 at 15:06

0 Answers0