1

Wondering if there are good alternatives to this that perform no worse than what I have below? The real switch statement has additional sections for other non-English characters.

Note that I'd love to put multiple case statements per line, but StyleCop doesn't like it and will fail our release build as a result.

        var retVal = String.Empty;
        switch(valToCheck)
        {
            case "é": 
            case "ê": 
            case "è": 
            case "ë":
                retVal = "e";
                break;
            case "à": 
            case "â": 
            case "ä": 
            case "å":
                retVal = "a";
                break;

            default:
                retVal = "-";
                break;
        }
larryq
  • 15,713
  • 38
  • 121
  • 190
  • 1
    Out of curiosity, why is the default `"-"`? – TheZ Jul 25 '12 at 20:19
  • You could build a lookup table (using the most readable method, which could be slow). The lookup is very simple and very very fast. – Ben Voigt Jul 25 '12 at 20:19
  • 1
    @LittleBobbyTables: C# does not support that at all, IIRC. – SLaks Jul 25 '12 at 20:19
  • You could return the result directly instead of setting `retval` (assuming it isn't used later). – Lee Jul 25 '12 at 20:20
  • @Slaks - thanks, it's been too long – LittleBobbyTables - Au Revoir Jul 25 '12 at 20:20
  • I'd try checking the character codes, if they're contiguous it would be easy to write a conditional like `charCode > 100 && charCode < 104` – TheZ Jul 25 '12 at 20:21
  • Worse performance, but it may be easier to read if you create lists of groups of characters and then see which one contains the value. – PCasagrande Jul 25 '12 at 20:24
  • What about [this](http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net)? – DaveShaw Jul 25 '12 at 20:24
  • possible duplicate of [Ignoring accented letters in string comparison](http://stackoverflow.com/questions/359827/ignoring-accented-letters-in-string-comparison) – Tim S. Jul 25 '12 at 20:26
  • Not sure of the effect on performance (probably none in an optimized build) but there's no need to assign `var retVal = string.Empty;` -- `string retVal;` will do just fine. – phoog Jul 25 '12 at 20:27
  • similar to http://stackoverflow.com/questions/4155382/a-faster-way-of-doing-multiple-string-replacements and http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net – hatchet - done with SOverflow Jul 25 '12 at 20:27
  • @TheZ-- This is for making URLs SEO friendly. Dashes are safe characters in lieu of spaces for instance (see the URL of this page as an example.) – larryq Jul 25 '12 at 20:51

6 Answers6

4

The first thing that comes to mind is a Dictionary<char,char>()
(I prefer char instead of strings because you are dealing with chars)

Dictionary<char,char> dict = new Dictionary<char,char>();
dict.Add('å', 'a');
......

then you could remove your entire switch

char retValue;
char testValue = 'å';
if(dict.TryGetValue(testValue, out retValue) == false)
   retVal = '-';
Steve
  • 213,761
  • 22
  • 232
  • 286
  • 1
    For performance use TryGetValue rather than ContainsKey. For compilation, use `Dictionary` rather than `Dictionary`. – phoog Jul 25 '12 at 20:26
  • `== true` is entirely redundant. – Servy Jul 25 '12 at 20:26
  • This was going to be my suggestion. @phoog I don't think TryGetValue is any more efficient - the documentation seems to suggest it internally just calls ContainsKey for you. – Tim Copenhaver Jul 25 '12 at 20:30
  • @TimCopenhaver, both methods use an internal function called FindEntry. The ContainsKey return immediately, while TryGetValue return the element at the internal index found or the default value for the type of second parameter. I think that, the TryGetValue is better, in term of performance, if compared with the couple of instructions needed with ContainsKey and then retVal = dict[key]. – Steve Jul 25 '12 at 20:43
  • 2
    rather than having `== false` you should just use the NOT operator (`!`). – Servy Jul 25 '12 at 20:46
  • @Servy, while your argument is understandable, I prefer to use explict boolean values in my code. I am not aware of any difference (a part from coding style preferences) – Steve Jul 25 '12 at 20:55
  • 2
    @TimCopenhaver To follow up on Steve's comment, if you call ContainsKey and then the indexer, you call FindEntry twice. It's analogous to looking in the real dictionary to see if a word is there, and, if it is, closing the dictionary, opening it to find the word again, and then reading the definition. – phoog Jul 25 '12 at 20:57
  • 1
    @Steve It's like adding `+ 0` or `* 1` to the end of a numeric calculation. It's pointless. – Servy Jul 25 '12 at 20:58
  • 2
    Also `char retValue = ' ';` could just be `char retValue;` – phoog Jul 25 '12 at 20:58
  • @Servy it's pointless unless you find `== false` easier to read than `!`. I don't imagine the optimized code is any different. – phoog Jul 25 '12 at 20:59
  • @phoog, removed the initialization of retValue as you suggested. – Steve Jul 25 '12 at 21:11
1

Well, start off by doing this transformation.

public class CharacterSanitizer
{
    private static Dictionary<string, string> characterMappings = new Dictionary<string, string>();
    static CharacterSanitizer()
    {
        characterMappings.Add("é", "e");
        characterMappings.Add("ê", "e");
        //...
    }

    public static string mapCharacter(string input)
    {
        string output;
        if (characterMappings.TryGetValue(input, out output))
        {
            return output;
        }
        else
        {
            return input;
        }
    }
}

Now you're in the position where the character mappings are part of the data, rather than the code. I've hard coded the values here, but at this point it is simple enough to store the mappings in a file, read in the file and then populate the dictionary accordingly. This way you can not only clean up the code a lot by reducing the case statement to one bit text file (outside of code) but you can modify it without needing to re-compile.

Servy
  • 202,030
  • 26
  • 332
  • 449
1

You could make a small range check and look at the ascii values.

Assuming InRange(val, min, max) checks if a number is, yep, in range..

if(InRange(System.Convert.ToInt32(valToCheck),232,235))
  return 'e';
else if(InRange(System.Convert.ToInt32(valToCheck),224,229))
  return 'a';

This makes the code a little confusing, and depends on the standard used, but perhaps something to consider.

Aesthete
  • 18,622
  • 6
  • 36
  • 45
1

This answer presumes that you are going to apply that switch statement to a string, not just to single characters (though that would also work).

The best approach seems to be the one outlined in this StackOverflow answer.

I adapted it to use LINQ:

var chars = from character in valToCheck.Normalize(NormalizationForm.FormD)
            where CharUnicodeInfo.GetUnicodeCategory(character)
                    != UnicodeCategory.NonSpacingMark
            select character;
return string.Join("", chars).Normalize(NormalizationForm.FormC);

you'll need a using directive for System.Globalization;

Sample input:

string valToCheck = "êéÈöü";

Sample output:

eeEou
Community
  • 1
  • 1
Adam
  • 15,537
  • 2
  • 42
  • 63
1

Based on Michael Kaplan's RemoveDiacritics(), you could do something like this:

static char RemoveDiacritics(char c)
{
    string stFormD = c.ToString().Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        }
    }

    return (sb.ToString()[0]);
}

switch(RemoveDiacritics(valToCheck))
{
    case 'e':
        //...
        break;
    case 'a':
        //...
        break;
        //...
}

or, potentially even:

retval = RemoveDiacritics(valToCheck);
Peter Ritchie
  • 35,463
  • 9
  • 80
  • 98
  • If performance is an issue, this would be slower than a dictionary approach. But, you could populate a dictionary with results from RemoveDiacritics. – Peter Ritchie Jul 25 '12 at 20:37
0

Use Contains instead of switch.

var retVal = String.Empty;

string es = "éêèë";
if (es.Contains(valToCheck)) retVal  = "e";
//etc.
ispiro
  • 26,556
  • 38
  • 136
  • 291
  • That will result in `"true"` or `"false"`, not converting an accent to the un-accented letter. – Servy Jul 25 '12 at 20:28
  • @Servy I just meant to suggest `Contains` instead of `switch`. I now edited my answer. – ispiro Jul 25 '12 at 20:29