7

I am developing a heuristic for automatic language detection and would like to find out whether the given letter has diacritics (like "Ðàäèî Êóëüòóðà" -- all letters have diacritics). It would be best if I could also get the type of diacritic, if possible.

I browsed through UnicodeCategory enum but didn't find anything that could help me here.

Alexander Galkin
  • 12,086
  • 12
  • 63
  • 115
  • The letter eth (Ð) has no diacritic. In Unicode, it is a basic character; the stroke is not regarded as a diacritic. You may thus wish to reformulate your goal (and possibly explain what specific problem it would solve, as there might be better approaches). – Jukka K. Korpela Feb 19 '12 at 14:11
  • 2
    Decomposing is the last thing you want to do. The combination of a specific letter with a specific diacritic is a strong selector for the language. Just build the frequency tables up front. But there are lots of languages that use next to no diacritics. You won't be able to tell the difference between English, Dutch and Italian for example. You'll need a dictionary to make it really work. Storing, say, the 100 most common words will go a long way. – Hans Passant Feb 19 '12 at 14:13

2 Answers2

16

One possible way is to normalize it to a form where letters and their diacritics are written as several codepoints. Then check if you have a letter followed by accents.

Adapting from How do I remove diacritics (accents) from a string in .NET?, you can normalize with Normalize(NormalizationForm.FormD) and check for the diacritics with UnicodeCategory.NonSpacingMark.

bool IsLetterWithDiacritics(char c)
{
    var s = c.ToString().Normalize(NormalizationForm.FormD);
    return (s.Length > 1)  &&
           char.IsLetter(s[0]) &&
           s.Skip(1).All(c2 => CharUnicodeInfo.GetUnicodeCategory(c2) == UnicodeCategory.NonSpacingMark);
}
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • 3
    If you need a true/false check, you could just normalize it to FormD or whatever it would be, and just check if the string is longer than the original. – Joakim Johansson Feb 19 '12 at 13:38
  • 1
    @JoakimJohansson I wouldn't be surprised if there were other glyphs that decompose in FormD, but aren't accented letters. But I don't know how well my idea would behave on those either. – CodesInChaos Feb 19 '12 at 13:39
  • 2
    @JoakimJohansson One big class of characters that your algorithm considers as having diacritics are korean hangul characters. These consists of several parts, which get decomposed, but are no diacritics. Some examples: `가`, `간`, `갂`. Then there are mathematical symbols such as: `≠`, `⊉`, `∄`, `∦` And finally a few that I don't know at all: `ஔ` – CodesInChaos Feb 19 '12 at 13:55
  • Diacritics are only a subset of non-spacing combining characters. For example, the unicode character [`"\u0CBF"`](http://www.fileformat.info/info/unicode/char/0cbf/index.htm) is `UnicodeCategory.NonSpacingMark`, but it's not a diacritic. – Paolo Moretti Feb 19 '12 at 14:36
  • @PaoloMoretti The input must be the decomposition of a single codepoint, and the first codepoint in the decomposition must be a letter. So you example doesn't fail the algorithm. I'm not sure if there are any false positives. There are a few strange cases in arabic script, but I don't know if those are diacritics, or not. But if you know a better algorithm, I'm open to suggestions. – CodesInChaos Feb 19 '12 at 14:42
0

Try this:


  public bool CheckIsStringContainDiacriticsCharacter(string text)
        {
            bool IsDiacriticsCharacter = false;

            var normalizedString = text.Normalize(NormalizationForm.FormD);
            var stringBuilder = new StringBuilder();
            foreach (var c in normalizedString)
            {
                var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
                if (unicodeCategory != UnicodeCategory.NonSpacingMark)
                {
                    stringBuilder.Append(c);
                }
                else
                {
                    IsDiacriticsCharacter = true;
                    break;
                }
            }
      
            return IsDiacriticsCharacter;
        }
Ashish Thakur
  • 213
  • 2
  • 12