8

I'm implementing for Lexical parsing in Tamil Language. I need to replace a Text Element value by following condition

        string[] ugaramStrings = { "கு", "சு", "டு", "து", "பு", "று" };
        string[] tamilvowels =
            {
                "அ",// "\u0b85"
                "ஆ",//"\u0b86"
                "இ",//"\u0b87"
                "ஈ",//"\u0b88"
                "உ",//"\u0b89"
                "ஊ",//"\u0b8A"
                "எ",// "\u0b8E"
                "ஏ",//"\u0b8F"
                "ஐ",//"\u0b90"
                "ஒ",//"\u0b92"
                "ஓ",//"\u0b93"
                "ஔ"//"\u0b94"
            };

if any word having element from ugaramStrings and tamil vowel element by consecutive. Is need to be replace ugaram string and return the value.

for eg.அமர்ந்*துஇ*னிது replaced as அமர்ந்**னிது. i.e துஇ=>இ

I've done it by checking next string element using TextElementEnumerator Class. Is it any possiblity is avail so that replace it by using RegularExpression

Arunkumar Chandrasekaran
  • 1,211
  • 4
  • 21
  • 40

1 Answers1

6

Try this:

string[] ugaramStrings = { "கு", "சு", "டு", "து", "பு", "று" };
string[] tamilvowels =
{
    "அ",// "\u0b85"
    "ஆ",//"\u0b86"
    "இ",//"\u0b87"
    "ஈ",//"\u0b88"
    "உ",//"\u0b89"
    "ஊ",//"\u0b8A"
    "எ",// "\u0b8E"
    "ஏ",//"\u0b8F"
    "ஐ",//"\u0b90"
    "ஒ",//"\u0b92"
    "ஓ",//"\u0b93"
    "ஔ"//"\u0b94"
};

var rxTemp = "(" +
    string.Join("|", ugaramStrings) + ")(" +
    string.Join("|", tamilvowels) + ")";

var rx = new Regex(rxTemp);

string str = "அமர்ந்*துஇ*னிது";

// This will contain all the matches
var matches = new List<Match>();

string str2 = rx.Replace(str, match => {
    matches.Add(match);
    // Group[1] will contain the ugaram letter, 
    // Group[2] will contain the tamil vowel
    return match.Groups[2].Value;
});

it seems to work correctly. The str2 will contain the replaced string while matches will contain all the matches

Note that ugaram characters are composed characters, for example, so each ugaram "character" uses two C# chars.

For example கு is 'க' + 'ு'.

This is illegal:

char ch = 'இ';

This is legal:

string str = "இ"; // str.Length == 2

For this reason you can't simply [குசுடுதுபுறு] but you have to (கு|சு|டு|து|பு|று).

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Very nice. I tried to do it the [] way and got unstuck as you can imagine. I was wondering if there is any way to tell regex to operate on the grapheme rather than per single byte character - i.e. treat graphemes as characters? I tried variations of String.Normalize and setting the culture on the Regex without much luck. – acarlon Sep 11 '13 at 11:35
  • @acarlon No, .NET regexes work on single `16bit char` (so sometimes half unicode char, for non-BMP characters), and don't handle directly full graphemes, so it isn't possible to do anything like that, sadly. – xanatos Sep 11 '13 at 11:38
  • Thanks, I figured as much. 'Single byte' should have been 'single char' in my comment. – acarlon Sep 11 '13 at 11:40