Low complexity algorithm to remove/replace special characters

Question

I want to replace some invalid characters in the name of a file uploaded to my application.

I've searched up to something on the internet and found some complex algorithms to do it, here's one:

        public static string RemoverAcentuacao(string palavra)
        {
            string palavraSemAcento = null;
            string caracterComAcento = "áàãâäéèêëíìîïóòõôöúùûüçáàãâÄéèêëíìîïóòõÖôúùûÜç, ?&:/!;ºª%‘’()\"”“";
            string caracterSemAcento = "aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC___________________";

            if (!String.IsNullOrEmpty(palavra))
            {
                for (int i = 0; i < palavra.Length; i++)
                {
                    if (caracterComAcento.IndexOf(Convert.ToChar(palavra.Substring(i, 1))) >= 0)
                    {
                        int car = caracterComAcento.IndexOf(Convert.ToChar(palavra.Substring(i, 1)));
                        palavraSemAcento += caracterSemAcento.Substring(car, 1);
                    }
                    else
                    {
                        palavraSemAcento += palavra.Substring(i, 1);
                    }
                }

                string[] cEspeciais = { "#39", "---", "--", "'", "#", "\r\n", "\n", "\r" };

                for (int q = 0; q < cEspeciais.Length; q++)
                {
                    palavraSemAcento = palavraSemAcento.Replace(cEspeciais[q], "-");
                }

                for (int x = (cEspeciais.Length - 1); x > -1; x--)
                {
                    palavraSemAcento = palavraSemAcento.Replace(cEspeciais[x], "-");
                }

                palavraSemAcento = palavraSemAcento.Replace("+", "-").Replace(Environment.NewLine, "").TrimStart('-').TrimEnd('-').Replace("<i>", "-").Replace("<-i>", "-").Replace("<br>", "").Replace("--", "-");
            }
            else
            {
                palavraSemAcento = "indefinido";
            }

            return palavraSemAcento.ToLower();
        }

There's a way to do it with a less complex algorithm?

I think this algorithm is very complex to something not too complex, but I can't think in something diferent of this.

that code is doing more than just removing all instances of characters with a set char list...if you *need* that more complex logic, then there's a lot less choice in the matter. If you only need to remove all instances of certain characters, it's a lot easier than that code. — Servy, Aug 14 '13 at 19:06
@Juhana Because I think it's always good write algorithms less complex, and always improve/reduce something big. — Wellington Zanelli, Aug 14 '13 at 19:06
If it works and you just want a nicer solution, you should post that on the [code review stack exchange website](http://codereview.stackexchange.com/), not here. — Pierre-Luc Pineault, Aug 14 '13 at 19:07
Are you talking about readability, or about runtime-complexity? The fact that you're using string concatenation makes this *much* slower than it should be. — Jon Skeet, Aug 14 '13 at 19:07

I4V · Answer 1 · 2013-08-14T19:53:26.173

I want to replace some invalid characters in the name of a file

if this is really what you want then it is easy

string ToLegalFileName(string s)
{
    var invalidChars = new HashSet<char>(Path.GetInvalidFileNameChars());
    return String.Join("", s.Select(c => invalidChars.Contains(c) ? '_' : c));
}

if your intent is to replace accented chars with their ascii counterparts then

string RemoverAcentuacao(string s)
{
    return String.Join("",
            s.Normalize(NormalizationForm.FormD)
            .Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
}

and this is the 3rd version which replaces accented chars + other chars with '_'

string RemoverAcentuacao2(string s)
{
    return String.Join("",
            s.Normalize(NormalizationForm.FormD)
            .Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            .Select(c => char.IsLetterOrDigit(c) ? c : '_')
            .Select(c => (int)c < 128 ? c : '_'));
}

I liked your previous version ;-) Actually I use what you proposed now in a similar form - it makes filenames correct but changes them a bit much. Just turning the accentuated letters into their non-accentuated is maybe smoother and more on the line with the question here. Would be cool to find a simple way to change accents. There is at least the possibility to sort without considering accents, maybe such transformation is also possible. — citykid, Aug 14 '13 at 19:52

score 0 · Answer 2 · answered Aug 14 '13 at 19:26

A solution using regular expressions:

string ReplaceSpecial(string input, string replace, char replacewith)
{
    char[] back = input.ToCharArray();
    var matches = Regex.Matches(String.Format("[{0}]", replace), input);
    foreach (var i in matches)
        back[i.Index] = replacewith;
    return new string(back);
}

A somewhat simpler solution using String.Replace:

string ReplaceSpecial(string input, char[] replace, char replacewith)
{
    string back = input;
    foreach (char i in replace)
        back.Replace(i, replacewith);
    return back;
}

score 0 · Answer 3 · answered Aug 14 '13 at 19:40

static string RemoverAcentuacao(string s)
{            
        string caracterComAcento = "áàãâäéèêëíìîïóòõôöúùûüçáàãâÄéèêëíìîïóòõÖôúùûÜç, ?&:/!;ºª%‘’()\"”“";
        string caracterSemAcento = "aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC___________________";
        return new String(s.Select(c =>
        {
            int i = caracterComAcento.IndexOf(c);
            return (i == -1) ? c : caracterSemAcento[i];
        }).ToArray());
}

score -1 · Answer 4 · answered Aug 14 '13 at 19:10

-1

Here is a really simple method that I've used recently.

I hope it meets your requirements. To be honest, the code is a bit difficult to read due to the language of the variable declarations.

    List<char> InvalidCharacters = new List<char>() { 'a','b','c' };        

    static string StripInvalidCharactersFromField(string field)
    {
        for (int i = 0; i < field.Length; i++)
        {
            string s = new string(new char[] { field[i] });
            if (InvalidCharacters.Contains(s))
            {
                field = field.Remove(i, 1);
                i--;
            }
        }

        return field;
    }

answered Aug 14 '13 at 19:10

Michael

1,803
1
17
26

He doesn't want to remove them, he wants to replace them (in spite of the title). – hatchet - done with SOverflow Aug 14 '13 at 19:14
This is an ungodly inefficient algorithm as well... – Servy Aug 14 '13 at 19:24
2

@Servy wonderful contribution, as usual. You're a real delight around here. – Michael Aug 14 '13 at 19:32
@Michael Happy to help. I wouldn't want someone to be unaware that they had provided a horribly, horribly inefficient solution to a question specifically asking to improve the performance of an existing solution. – Servy Aug 14 '13 at 19:34
1

@Servy I fear you've missed the point entirely. He asked for a 'less complex' algorithm, and the word performance appeared exactly 0 times in his post. I'd wager this could be considered far less complex than his original depending on criteria for 'complex'. Additionally, give at least a shade of an explanation for the inefficiencies in my post. You haven't made me, or anyone, aware of a single thing. The spirit of SO is expansion of knowledge, even if delivered harshly. Not simply being... unpleasant. Where's your answer? – Michael Aug 14 '13 at 19:39
1

@Michael The code in the OP *does more*. This code sample doesn't perform the same level of functionality, so comparing the amount or complexity of the code isn't really relevant; they don't do the same thing. If you're curious as to what I was referring to, you could have simply asked. There are a number of fundamental problems. `Remove` is going to be doing a copy of all but one of the characters, for each one you remove. That will result in a *lot* of redundant copying of data. `List` also can't be efficiently searched, you should be using a HashSet. That should get you started. – Servy Aug 14 '13 at 19:44
1

@Servy Thank you. I've been inspired to learn more about strings and their immutability (which i believe to be the cause of the excessive data copy). – Michael Aug 14 '13 at 20:18

Low complexity algorithm to remove/replace special characters

4 Answers4