8

I need to convert unicode string to unicode characters.

for eg:Language Tamil

"கமலி"=>'க','ம','லி'

i'm able to strip unicode bytes but producing unicode characters is became problem.

byte[] stringBytes = Encoding.Unicode.GetBytes("கமலி");
char[] stringChars = Encoding.Unicode.GetChars(stringBytes);
foreach (var crt in stringChars)
 {
     Trace.WriteLine(crt);
 }

it gives result as :

'க'=>0x0b95

'ம'=>0x0bae

'ல'=>0x0bb2

'ி'=>0x0bbf

so here the problem is how to strip character 'லி' as it as 'லி' without splitting like 'ல','ி'.

since it is natural in Indian language by representing consonant and vowel as single characters but parsing with c# make difficulty.

All i need to be split into 3 characters.

Arunkumar Chandrasekaran
  • 1,211
  • 4
  • 21
  • 40
  • 1
    What do you mean "how to strip character..."? Can you show what result you expect? – Alexei Levenkov Dec 20 '12 at 06:34
  • don't you see character 'லி' as it as 'லி'. without splitting like 'ல','ி'. – Arunkumar Chandrasekaran Dec 20 '12 at 06:41
  • http://www.unicode.org/charts/PDF/U0B80.pdf Read this... unicode consortium have designed it as so.. – Uthistran Selvaraj Dec 20 '12 at 06:51
  • whatever I'm not asking about there design. I'm asking splitting Unicode string to Unicode characters as it Indian language – Arunkumar Chandrasekaran Dec 20 '12 at 06:55
  • 1
    I see that these 2 `Char` rendered as one [glyph](http://en.wikipedia.org/wiki/Glyph) or [ligature](http://en.wikipedia.org/wiki/Typographic_ligature) - I don't know which... But it still unclear what you want. I suspect the answer is hidden in description of [Char](http://msdn.microsoft.com/en-us/library/system.char(v=vs.100).aspx) and [StringInfo](http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo(v=vs.100).aspx) but you need to edit your question so it is easier to understand. – Alexei Levenkov Dec 20 '12 at 06:56

1 Answers1

13

To iterate over graphemes you can use the methods of the StringInfo class.

Each combination of base character + combining characters is called a 'text element' by the .NET documentation, and you can iterate over them using a TextElementEnumerator:

var str = "கமலி";
var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(str);
while (enumerator.MoveNext())
{
    Console.WriteLine(enumerator.Current);
}

Output:

க
ம
லி
porges
  • 30,133
  • 4
  • 83
  • 114