1

I am trying to represent devanagari characters on a screen, but in the dev environment where I'm programming I don't have unicode support. Then, to write characters I use binary matrices to color the related screen's pixels. I sorted these matrices according to the unicode order. For the languages that uses the latin alphabet I had no issues, I only needed to write the characters one after the other to represent a string, but for the devanagari characters it's different.

In the devanagari script some characters, when placed next to others can completely change the appearance of the word itself, both in the order and in the appearance of the characters. The resulting characters are considered as a single character, but when read as unicode they actually return 2 distinct characters.

This merging sometimes occurs in a simple way:

क + ् = क्

ग + ् = ग्

फ + ि = फि

But other times you get completely different characters:

क + ् + क = क्क

ग + ् + घ = ग्घ

क + ् + ष = क्ष

I found several papers describing the complex grammatical rules that determine how these characters merges (https://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf), but the more I look into it the more I realize that I need to learn Hindi for understand that rules and then create an algorithm.

I would like to understand the principles behind these characters combinations but without necessarily having to learn the Hindi language. I wonder if anyone before me has already solved this problem or found an alternative solution and would like to share it with me.

  • "I don't have Unicode support". I' curious to know which developing environment are you using. Note: it depend on the language (and dialect), not just the script. In any case how do you display characters without Unicode supports? Do you need to implement all stack? Or in reality you have all Unicode support, but you didn't yet realize it? Usually we send Unicode data to OS, and it will do all complex part (segmentation, shaping, glyph displaying, etc.) – Giacomo Catenazzi Dec 02 '22 at 10:53
  • I'm on an embedded C89 environment, so I have to implement all the layers. I don't have an OS with built-in functions, so to write characters i use binary matrices to draw pixel by pixel on the screen. – Arcangelo Pace Dec 02 '22 at 12:03
  • So, probably it is better to check old standards (more in line with bitmap) then Unicode which expect more typographic quality (and ligatures). See e.g. ISCII standard https://varamozhi.sourceforge.net/iscii91.pdf (from ISCII page on Wikipedia) – Giacomo Catenazzi Dec 02 '22 at 12:56
  • 1
    ISCII won't solve his problem. What Arcangelo is missing is layers that handle glyph shaping --- mapping a character string to a sequence of positioned glyphs through steps of glyph re-ordering, substitution, and positioning adjustment. ISCII is just another character encoding, and the set of characters correspond 1:1 to those in Unicode. – Peter Constable Dec 02 '22 at 19:56

1 Answers1

1

Whether Devanagari text is encoded using Unicode or ISCII, display of the text requires a combination of shaping engine and font data that maps a string of characters into an appropriate sequence of positioned glyphs. The set of glyphs needed for Devanagari will be a fair bit larger than the initial set of characters.

The shaping steps involves an analysis of clusters, re-ordering of certain elements within clusters, substitution of glyphs, and finally positioning adjustments to the glyphs. Consider this example:

क + ् + क + ि = क्कि

The cluster analysis is needed to recognize elements against a general cluster pattern — e.g., which comprise the "base" consonant within the cluster, which are additional consonants that will conjoin to it, which are vowels and what the type of vowel with regard to visual positioning. In that sequence, the <ka, virama, ka> sequence will form a base that vowel or other marks are positioned relative to. The second ka is the "base" consonant and the inital <ka, virama> sequence will conjoin as a "half" form. And the short-i vowel is one that needs to be re-positioned to the left of the conjoined-consonant combination.

The Devanagari section in the Unicode Standard describes in a general way some of the actions that will be needed in display, but it's not a specific implementation guide.

The OpenType font specification supports display of scripts like Devanagari through a combination of "OpenType Layout" data in the font plus shaping implementations that interact with that data. You can find documentation specifically for Devanagari font implementations here:

https://learn.microsoft.com/en-us/typography/script-development/devanagari

You might also find helpful the specification for the "Universal Shaping Engine" that several implementations use (in combination with OpenType fonts) for shaping many different scripts:

https://learn.microsoft.com/en-us/typography/script-development/use

You don't necessarily need to use OpenType, but you will want some implementation with the functionality I've described. If you're running in a specific embedded OS environment that isn't, say, Windows IOT, you evidently can't take advantage of the OpenType shaping support built into Windows or other major OS platforms. But perhaps you could take advantage of Harfbuzz, which is an open-source OpenType shaping library:

https://github.com/harfbuzz/harfbuzz

This would need to be combined with Devanagari fonts that have appropriate OpenType Layout data, and there are plenty of those, including OSS options (e.g., Noto Sans Devanagari).

Peter Constable
  • 2,707
  • 10
  • 23
  • I have been searching for several days for a more immediate alternative, but I think I have come to the conclusion that there is no other way that can guarantee to obtain a correct result. I thank Peter Constable for providing this solution I appreciate it very much. – Arcangelo Pace Dec 13 '22 at 14:01
  • @ArcangeloPace It would be helpful for others if you marked the answer as an accepted answer. – Peter Constable Dec 13 '22 at 16:47