Regular Expressions in C# for Character Equivalents

Question

How to search a string in c# using Regex, ignoring accents;

For example in Notepad++, for ancient Greek, searching with regex : [[=α=]] will return: α, ἀ ἁ, ᾶ, ὰ, ά, ᾳ, ....

I know Notepad++ is using PCRE standard. How to do this in c# ? Is there an equivalence syntax ?

Edit:

I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Combining characters". The ᾶ character is a separate character in Unicode!

Try [`[\p{IsGreek}\p{IsGreekExtended}]+`](http://regexstorm.net/tester?p=%5b%5cp%7bIsGreek%7d%5cp%7bIsGreekExtended%7d%5d%2b&i=%ce%b1%2c+%e1%bc%80+%e1%bc%81%2c+%e1%be%b6%2c+%e1%bd%b0%2c+%ce%ac%2c+%e1%be%b3%2c+....&o=c) — ctwheels, Apr 06 '18 at 16:50
@ctwheels doesn't that match all Greek letters and not just the variation of "a"? — juharr, Apr 06 '18 at 16:52
.NET regex does not support POSIX collations. Normalize string first or use character classes like `[αἀἁᾶὰάᾳ]` — Wiktor Stribiżew, Apr 06 '18 at 19:09
Already tried normalization. Please see edited question. It looks like character classes is the only solution. Will it be efficient ? Just for letter -α- the class will be [ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ]. Searching for a word will make a very large Regex. — Prodromos, Apr 07 '18 at 05:08
@Have you tried [normalization like this](http://archives.miloush.net/michkap/archive/2007/05/14/2629747.html)? Because when I run that on `ᾶ` I get `α`. — Rawling, Apr 07 '18 at 06:06
Rawling : Yes you are right. Is working. The point is that NormalizationForm.FormD will convert the string to "Combining Characters" for the accents. The visual representation of this is exactly the same as input string. So I was confused. The trick is to do the NormalizationForm.FormC after this to remove the accents. Thank you All. — Prodromos, Apr 07 '18 at 11:40

score 3 · Accepted Answer · answered Apr 07 '18 at 07:00

The System.String.Normalize method seems to be still the key to solve this problem.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Globalization;
using System.Linq;

public class Program
{
    public static void Main()
    {
        string rawInput = "ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ";
        Console.WriteLine(rawInput);
        string normalizedInput = Utility.RemoveDiacritics(rawInput);    
        string pattern = "α+";

        var result = Regex.Matches(normalizedInput, pattern);
        if(result.Count > 0)
            Console.WriteLine(result[0]);    
    }
}

public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        return new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    }
}

Output:

ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷᾶ
αααααααααααααααααααααααααα

Demo

Original Method by Kaplan:

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();        
    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }       
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

References:

Michael S. Kaplan: FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)
Michael S. Kaplan: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)
Code adopted from: See also: How do I remove diacritics (accents) from a string in .NET?

PS: Unfortunately, PCRE.NET, Lucas Trzesniewski's .NET wrapper for the PCRE library does not support (extended) POSIX collating elements.

Excellent Solution! Yes I've already tested PCRE.NET and is not supporting POSIX collating elements. — Prodromos, Apr 07 '18 at 11:12

score 0 · Answer 2 · answered Apr 06 '18 at 18:09

0

There are a few questions that might be able to help which have already been answered -

How do I remove diacritics (accents) from a string in .NET?

Regex accent insensitive?

answered Apr 06 '18 at 18:09

Amber Normand

650
1
6
10

I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Compining characters". Also checked the second link. It looks like is not relevant to me, since it solves a Case matching problem! – Prodromos Apr 07 '18 at 04:58

Regular Expressions in C# for Character Equivalents

2 Answers2