13

I've found a answer how to remove diacritic characters on stackoverflow, but could you please tell me if it is possible to change diacritic characters to non-diacritic ones?

Oh.. and I think about .NET (or other if not possible)

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Tom Smykowski
  • 25,487
  • 54
  • 159
  • 236
  • When I had to do this in perl I just had a big long hand-maintained "tr" statement, so good luck. – Paul Tomblin Dec 01 '08 at 16:16
  • this is a duplicate of _several_ questions. search for "translit", for example. please don't butcher our languages! –  Dec 01 '08 at 16:27

5 Answers5

29

Since no one has ever bothered to post the code to do this, here it is:

    // \p{Mn} or \p{Non_Spacing_Mark}: 
    //   a character intended to be combined with another 
    //   character without taking up extra space 
    //   (e.g. accents, umlauts, etc.). 
    private readonly static Regex nonSpacingMarkRegex = 
        new Regex(@"\p{Mn}", RegexOptions.Compiled);

    public static string RemoveDiacritics(string text)
    {
        if (text == null)
            return string.Empty;

        var normalizedText = 
            text.Normalize(NormalizationForm.FormD);

        return nonSpacingMarkRegex.Replace(normalizedText, string.Empty);
    }

Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.

Diadistis
  • 12,086
  • 1
  • 33
  • 55
dan
  • 9,712
  • 6
  • 49
  • 62
11

Copying from my own answer to another question:

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

Community
  • 1
  • 1
CesarB
  • 43,947
  • 7
  • 63
  • 86
  • 9
    please don't do this, if possibly. you are butchering our languages. try to use transliteration –  Dec 01 '08 at 16:24
  • @hop, there are many valid reasons to do this (generating normalized n-grams for lexical analysis for example) – Diadistis Jun 26 '11 at 17:49
  • @Diadistis: a) i don't think proper transliteration hinders that kind of analysis and b) "many valid reasons"? name a few… –  Jun 26 '11 at 23:47
  • 4
    @hop, Here's an example, I have a system were users search for content, and I mostly users type their queries without accents, and causing a mismatch between the content in the index and the query. – Amit Bens Oct 18 '11 at 09:18
  • @Amit: then you are doing search wrong. –  Oct 18 '11 at 22:51
  • 3
    @hop your understanding of search is wrong – maxbeaudoin Jun 06 '12 at 13:51
  • I agree with Amit that search is a valid use for this. I'm having problems with search performance in MySQL because I'm using the "NOCASE" option on my table. My solution is to normalize the stored strings and the query strings to plain lowercase ASCII so that no transformations are needed in the search engine. – John Stephen Jun 28 '12 at 19:35
  • 2
    @hop: ["This is America, Take Your Unicode Somewhere Else."](http://web.archive.org/web/20110520135634/http://teddziuba.com/2009/07/this-is-america-take-your-unic.html) Millions of users don't know or care how to type diacritics. If they can't find what they're looking for, they way they want to look for it, they will blame your application. – Iain Samuel McLean Elder Jul 03 '12 at 13:35
  • @isme: if you do search in natural language text by character-by-character comparison, you are doing it wrong(r) anyway. Also, nice, intelligent argument there. –  Jul 03 '12 at 16:50
  • 1
    @hop It's a troll of a title, I know, but an instructive story. As you say, not all transliterations are character-by-character. For example, in German, the names 'Düsseldorf' and 'Duesseldorf' are equivalent; the second is acceptable in URLs and e-mail addresses. But an untrained English user, ignorant of German orthography, will type 'Dusseldorf' and expect to find the same results. Google, for example, knows this, and treats all three as synonyms. – Iain Samuel McLean Elder Jul 03 '12 at 20:15
  • @isme: Google treats far more than those three as synonyms, that's my whole point! It recognizes "duseldorf" and "duseldorff" as well, for example. Because it exactly does not "strip those silly dots that only confuse merkins and compare char-by-char" :^) –  Jul 06 '12 at 09:15
  • @hop: I'm curious. What are you asking for? By transliteration, do you mean that people should be able to enter only 'Düsseldorf' and 'Duesseldorf', or is 'Dusseldorf' also acceptable? This is leaving aside the issue of dealing with typos and mis-spellings, such as transposing letters ("Dussledorf'), doubling letters that shouldn't be ('Duseldorff'), or not doubling letters that should be ('Duseldorf'). – Simon Elms Jul 15 '12 at 22:48
4

It might also be worthwhile to step back and consider why you want to do this. If you are trying to remove character differences you consider insignificant, you should look at the Unicode collation algorithm. This is the standard way to disregard differences such as case or diacritics when comparing strings for searching or sorting.

If you plan to display the modified text, consider your audience. What you can safely filter away is locale sensitive. In US English, "Igloo" = "igloo", and "resume" = "résumé", but in Turkish, a lower case I is ı (dotless), and in French, cote means quote, côté means side, and côte means coast. So, the collation language determines what differences are significant.

If removing diacritics is the right solution for your application, it is safest to produce your own table to which you explicitly add the characters you want to convert.

A general, automated approach could be devised using Unicode decomposition. With this, you can decompose a character with diacritics to "combining" characters (the diacritic marks) and the base character with which they are combined. Filter out any thing that is a combining character, and you should have the "non-diacritic" ones.

The lack of discrimination in the automated method, however, could have some unexpected effects. I'd recommend a lot of testing on a representative body of text.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • 2
    I think one of uses of this is to create nice URLs – Tom Smykowski Dec 01 '08 at 21:57
  • Absolutely. If you have a product named "Rändi Fay_Female Vocalist" and you need to generate a url stub /product/something, your choices are essentially to replace the accented a with an unaccented one, or to URL-escape the string leaving an ugly percent in there. The unaccented a is far preferable. URLs are machine-readable strings but it's often important that they be at least semi-human-readable. – Ross Presser Mar 04 '16 at 16:23
2

For a simple example:

To remove diacritics from a string:

string newString = myDiacriticsString.Normalize(NormalizationForm.FormD);
Chris James
  • 11,571
  • 11
  • 61
  • 89
  • 4
    does not work : "ě".Normalize(NormalizationForm.FormD) does not return "e" – Feryt May 19 '10 at 11:05
  • Yes it does, use String.ToCharArray() to see it. – Hans Passant Jun 07 '10 at 19:34
  • Just like Feryt it doesn't work for me. ("xxé").Normalize(NormalizationForm.FormD) returns "xxe" (like expected), but string v = "xxé"; v.Normalize(NormalizationForm.FormD); returns "xxé". I tried to call v.ToCharArray() and ("xxé").ToCharArray() to see if there is any difference, they return the same array. Very strange ! – AFract Jul 10 '13 at 12:22
  • That is not the whole story. NormalizationForm.FormD will remove the accent but it adds the accent as a separate char. Check the length of the ToCharArray. – paparazzo Oct 10 '13 at 16:24
0

My site inputs data from external sources which have many strange characters. I wrote the following C# function to replace accented characters and strip out non-US keyboard characters using Regex:

    using System.Text;
    using System.Text.RegularExpressions;

    internal static string SanitizeString(string source)
    {
        return Regex.Replace(source.Normalize(NormalizationForm.FormD), @"[^A-Za-z 0-9 \.,\?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]`~]*", string.Empty).Trim();    
    }

Hope it helps.

Jon B
  • 51,025
  • 31
  • 133
  • 161