4

How to search a string in c# using Regex, ignoring accents;

For example in Notepad++, for ancient Greek, searching with regex : [[=α=]] will return: α, ἀ ἁ, ᾶ, ὰ, ά, ᾳ, ....

I know Notepad++ is using PCRE standard. How to do this in c# ? Is there an equivalence syntax ?

Edit:

I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Combining characters". The ᾶ character is a separate character in Unicode!

Prodromos
  • 146
  • 2
  • 10
  • Try [`[\p{IsGreek}\p{IsGreekExtended}]+`](http://regexstorm.net/tester?p=%5b%5cp%7bIsGreek%7d%5cp%7bIsGreekExtended%7d%5d%2b&i=%ce%b1%2c+%e1%bc%80+%e1%bc%81%2c+%e1%be%b6%2c+%e1%bd%b0%2c+%ce%ac%2c+%e1%be%b3%2c+....&o=c) – ctwheels Apr 06 '18 at 16:50
  • 2
    @ctwheels doesn't that match all Greek letters and not just the variation of "a"? – juharr Apr 06 '18 at 16:52
  • .NET regex does not support POSIX collations. Normalize string first or use character classes like `[αἀἁᾶὰάᾳ]` – Wiktor Stribiżew Apr 06 '18 at 19:09
  • Already tried normalization. Please see edited question. It looks like character classes is the only solution. Will it be efficient ? Just for letter -α- the class will be [ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ]. Searching for a word will make a very large Regex. – Prodromos Apr 07 '18 at 05:08
  • 1
    @Have you tried [normalization like this](http://archives.miloush.net/michkap/archive/2007/05/14/2629747.html)? Because when I run that on `ᾶ` I get `α`. – Rawling Apr 07 '18 at 06:06
  • Rawling : Yes you are right. Is working. The point is that NormalizationForm.FormD will convert the string to "Combining Characters" for the accents. The visual representation of this is exactly the same as input string. So I was confused. The trick is to do the NormalizationForm.FormC after this to remove the accents. Thank you All. – Prodromos Apr 07 '18 at 11:40

2 Answers2

3

The System.String.Normalize method seems to be still the key to solve this problem.

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Globalization;
using System.Linq;

public class Program
{
    public static void Main()
    {
        string rawInput = "ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷ";
        Console.WriteLine(rawInput);
        string normalizedInput = Utility.RemoveDiacritics(rawInput);    
        string pattern = "α+";

        var result = Regex.Matches(normalizedInput, pattern);
        if(result.Count > 0)
            Console.WriteLine(result[0]);    
    }
}

public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (null == str) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        return new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    }
}

Output:

ἀἁἂἃἄἅἆἇὰάᾀᾁᾂᾃᾄᾅᾆᾇᾰᾱᾲᾳᾴᾶᾷᾶ
αααααααααααααααααααααααααα

Demo

Original Method by Kaplan:

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();        
    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }       
    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

References:

PS: Unfortunately, PCRE.NET, Lucas Trzesniewski's .NET wrapper for the PCRE library does not support (extended) POSIX collating elements.

wp78de
  • 18,207
  • 7
  • 43
  • 71
0

There are a few questions that might be able to help which have already been answered -

How do I remove diacritics (accents) from a string in .NET?

Regex accent insensitive?

Amber Normand
  • 650
  • 1
  • 6
  • 10
  • I've already tried string normalization. Is not working for Greek. for example : "ᾶ".Normalize(NormalizationForm.FormC) will return ᾶ. It looks like normalization removes accents only in case of "Compining characters". Also checked the second link. It looks like is not relevant to me, since it solves a Case matching problem! – Prodromos Apr 07 '18 at 04:58