Most efficient way to remove special characters from string

Question

I want to remove all special characters from a string. Allowed characters are A-Z (uppercase or lowercase), numbers (0-9), underscore (_), or the dot sign (.).

I have the following, it works but I suspect (I know!) it's not very efficient:

    public static string RemoveSpecialCharacters(string str)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < str.Length; i++)
        {
            if ((str[i] >= '0' && str[i] <= '9')
                || (str[i] >= 'A' && str[i] <= 'z'
                    || (str[i] == '.' || str[i] == '_')))
                {
                    sb.Append(str[i]);
                }
        }

        return sb.ToString();
    }

What is the most efficient way to do this? What would a regular expression look like, and how does it compare with normal string manipulation?

The strings that will be cleaned will be rather short, usually between 10 and 30 characters in length.

I won't put this in an answer since it won't be any more efficient, but there are a number of static char methods like char.IsLetterOrDigit() that you could use in your if statement to make it more legible at least. — Martin Harris, Jul 13 '09 at 15:40
I'm not sure that checking for A to z is safe, in that it brings in 6 characters that aren't alphabetical, only one of which is desired (underbar). — Steven Sudit, Jul 13 '09 at 15:41
Focus on making your code more readable. unless you are doing this in a loop like 500 times a second, the efficiency isn't a big deal. Use a regexp and it will be much easier to read.l — Byron Whitlock, Jul 13 '09 at 15:42
Martin, in my experience, the list of characters to filter tends to shift over time, and doesn't necessarily correspond perfectly to any of the char.IsSomething() methods. That's one of the reasons I've leaned towards a table-driven approach. — Steven Sudit, Jul 13 '09 at 15:42
Byron, you're probably right about needing to emphasize readability. However, I'm skeptical about regexp being readable. :-) — Steven Sudit, Jul 13 '09 at 15:45
Regular expressions being readable or not is kind of like German being readable or not; it depends on if you know it or not (although in both cases you will every now and then come across grammatical rules that make no sense ;) — Blixt, Jul 13 '09 at 15:50
Point taken. Regexp are not a bad thing and there are certainly many places where they fit admirably. — Steven Sudit, Jul 13 '09 at 16:44
Using @Luke's answer to ditch the StringBuilder for a char[] will provide the largest absolute speedup over any of the other techniques shown. Not what I expected. — user7116, Jul 13 '09 at 17:27

score 388 · Accepted Answer · edited Jun 25 '17 at 08:46

388

Why do you think that your method is not efficient? It's actually one of the most efficient ways that you can do it.

You should of course read the character into a local variable or use an enumerator to reduce the number of array accesses:

public static string RemoveSpecialCharacters(this string str) {
   StringBuilder sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

One thing that makes a method like this efficient is that it scales well. The execution time will be relative to the length of the string. There is no nasty surprises if you would use it on a large string.

Edit:
I made a quick performance test, running each function a million times with a 24 character string. These are the results:

Original function: 54.5 ms.
My suggested change: 47.1 ms.
Mine with setting StringBuilder capacity: 43.3 ms.
Regular expression: 294.4 ms.

Edit 2: I added the distinction between A-Z and a-z in the code above. (I reran the performance test, and there is no noticable difference.)

Edit 3:
I tested the lookup+char[] solution, and it runs in about 13 ms.

The price to pay is, of course, the initialization of the huge lookup table and keeping it in memory. Well, it's not that much data, but it's much for such a trivial function...

private static bool[] _lookup;

static Program() {
   _lookup = new bool[65536];
   for (char c = '0'; c <= '9'; c++) _lookup[c] = true;
   for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
   for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
   _lookup['.'] = true;
   _lookup['_'] = true;
}

public static string RemoveSpecialCharacters(string str) {
   char[] buffer = new char[str.Length];
   int index = 0;
   foreach (char c in str) {
      if (_lookup[c]) {
         buffer[index] = c;
         index++;
      }
   }
   return new string(buffer, 0, index);
}

edited Jun 25 '17 at 08:46

nologo

5,918
3
36
50

answered Jul 13 '09 at 15:45

Guffa

687,336
108
737
1,005

6

I agree. The only other change I would make is to add the initial capacity argument to the StringBuilder constructor, "= new StringBuilder(str.Length)". – David Jul 13 '09 at 16:05
+1, and don't forget to change it to: (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') for correctness. – user7116 Jul 13 '09 at 16:19
@David: Yes setting the capacity would give a slight performance improvement. – Guffa Jul 13 '09 at 16:22
2

My answer, using a `char[]` buffer rather than `StringBuilder`, has a slight edge on this one according to my testing. (Mine's less readable though, so the small performance benefit probably isn't worth it.) – LukeH Jul 13 '09 at 16:46
Luke, while it may well be faster to append to a char buffer than a StringBuffer, we still need to copy it into a string when we're done. The StringBuffer uses a string internally, so there's no additional copy. – Steven Sudit Jul 13 '09 at 16:52
1

@Steven: That may well be the case, but the benchmarks speak for themselves! In my tests, using a `char[]` buffer performs (slightly) better than `StringBuilder`, even when scaling up to strings that are tens of thousands of characters in length. – LukeH Jul 13 '09 at 17:02
@Guffa: yours 0.0416ms/string, int[]/bool[] lookup 0.0399ms/string. Yours is also 10x readable. – user7116 Jul 13 '09 at 17:10
@Luke: yours is 0.0294ms/string using @Guffa's with char[] v. 0.0416ms/string with StringBuffer. Quite the improvement good sir, couple it with a lookup table and you're at 0.0286ms/string. – user7116 Jul 13 '09 at 17:17
@Luke: Ok, then I'll have to benchmark char[]->StringBuffer against straight StringBuffer the next time I have a particular data set to optimize for. – Steven Sudit Jul 13 '09 at 18:21
Just so you know, don't copy-paste this code (as I did), the '9' char in the loop is actually the number 9, so char numbers will not actually be put into the allowed array. – cthulhu Feb 11 '11 at 12:27
1

@Cthulhu: Thanks, I corrected that. Still, as always, the code is not guaranteed to be bug free. :) – Guffa Feb 11 '11 at 13:02
11

@downvoter: Why the downvote? If you don't explain what you think is wrong, it can't improve the answer. – Guffa Aug 06 '11 at 18:56
Other than specifications in the question, is there a reason for ommiting the "white space" character ' '? For functional purposes I may consider adding the (c == ' ') and starting off the method with str.Trim() to remove leading and trailing white spaces. – Zack Jannsen Aug 10 '12 at 14:26
@ZackJannsen: White space characters are all "invisible" character, such as spaces, tabs and line breaks. The `' '` character is not called white space, it's the space character. If you want a function that removes spaces also, you can do that, but that's not what the question was. – Guffa Aug 10 '12 at 14:53
1

(Edit 3) I would make this array much smaller. I think you can probably represent all needed characters in 2^7 (127) space, like ASCII character set. Even though you are filling this array programatically, all characters are known at compile time, you could even make it a readonly array. The danger is that others might copy/paste this code into a subroutine and allocate much more space than's needed more often than required. – Sprague Dec 13 '13 at 15:04
@Guffa In order for the array to cover all char values, you should instantiate it with a length of 65536 so it goes from 0 to 65535. By the way, I use your solution for my code and it is great. – Luke Marlin Jan 08 '15 at 10:06
@LukeMarlin: Naturally it should be 65536, thanks for spotting that error. – Guffa Jan 08 '15 at 10:12
@Guffa Do your performance measurements include the bool table creation? – SILENT May 05 '15 at 01:40
2

@SILENT: No, it doesn't, but you should only do that once. If you allocate an array that large each time you call the method (and if you call the method frequently) then the method becomes the slowest by far, and causes a lot of work for the garbage collector. – Guffa May 05 '15 at 07:36
Instead of using A to Z you could also use char.IsLetter(c) for unicode support. – XzaR Apr 23 '16 at 09:48
Just for sake of theory, what if **no extra space** is to be used to remove special elements from a list. Note that there is an **array instead of a string** so mutation is possible. Is the algorithm simply traversing over the array and exchanging elements that are special to the end? That seems natural but I came here because of that question and wanted to confirm. – user3245268 Dec 11 '19 at 05:50
@Guffa Why not use a dictionary/hashset instead of lookup table? – manu4rhyme Feb 18 '20 at 07:16
@manu4rhyme It's slower, it takes almost twice as long as my suggested function. Looking up a value in an array is straight forward. Looking up a value in a hash set first has to create a hash code from the value, then calculate a bucket index from the hash code, then loop through the items in the bucket (if any) and compare each to the value. – Guffa Feb 23 '20 at 22:23
@Guffa - Regex could be a good idea for this, using `Regex.Replace(yourString, @"[^0-9a-zA-Z]+", "");` - That seems to do the job fairly quick. – Momoro Mar 28 '20 at 06:04
@Momoro I tested that already, and as you can see from the results above it's an order of magnitude slower. It's still a reasonable solution for when the performance is not of great concern. – Guffa Apr 17 '20 at 08:36
@Guffa I agree. I actually prefer performance, and now that you pointed it out, it is a tad (**Maybe a whole lot**) slower :D – Momoro Apr 18 '20 at 03:25
I know this is an older post, however, the fact that people are taking the time to update multiple with iterations indicates a true desire to want to improve the answer and deserves a kudos in my opinion. A feather in the cap for the community, well done everyone! – Erick Brown Jun 28 '20 at 13:28
I bet this is a better solution than going for Regex bcos this is a hand optimized code and Regex are built in functions that also goes in loops and has no magic whatsoever ! – Venugopal M Dec 19 '22 at 12:11

Blixt · Answer 2 · 2009-07-13T16:37:00.993

237

Well, unless you really need to squeeze the performance out of your function, just go with what is easiest to maintain and understand. A regular expression would look like this:

For additional performance, you can either pre-compile it or just tell it to compile on first call (subsequent calls will be faster.)

public static string RemoveSpecialCharacters(string str)
{
    return Regex.Replace(str, "[^a-zA-Z0-9_.]+", "", RegexOptions.Compiled);
}

edited Jul 13 '09 at 16:37

answered Jul 13 '09 at 15:40

Blixt

49,547
13
120
153

1

I'd guess that this is probably a complex enough query that it would be faster than the OP's approach, especially if pre-compiled. I have no evidence to back that up, however. It should be tested. Unless it's drastically slower, I'd choose this approach regardless, since it's way easier to read and maintain. +1 – rmeador Jul 13 '09 at 15:48
6

Its a very simple regex (no backtracking or any complex stuff in there) so it should be pretty damn fast. – Jul 13 '09 at 16:00
Perhaps. But do you want to bet the table-driven approach will be faster? – Steven Sudit Jul 13 '09 at 16:11
10

@rmeador: without it being compiled it is about 5x slower, compiled it is 3x slower than his method. Still 10x simpler though :-D – user7116 Jul 13 '09 at 16:15
I'd definitely recommend Steven's method if performance is critical. Regular expressions make text validation/modification simple, not efficient, as many are inclined to think. – Blixt Jul 13 '09 at 16:21
7

Regular expressions are no magical hammers and never faster than hand optimized code. – Christian Klauser Jul 13 '09 at 16:58
3

For those who remember Knuth's famous quote about optimization, this is where to start. Then, if you find that you need the extra thousandth of a millisecond performance, go with one of the other techniques. – John Feb 25 '14 at 19:02
always believed regex were faster. Thanks @ChristianKlauser +1 :) – Vbp Mar 17 '14 at 01:22
1

@Blixt - Can you also help to remove dot character and numbers, plz? – Zameer Ansari Mar 19 '15 at 16:15
1

@nerdspal just remove the dot from the regex expression it would be "[^a-zA-Z0-9_]+" – nramirez Apr 21 '15 at 16:11
Might save a few ticks and GC by not using the static Replace method, and initialize the Regex as a static instance? – bigfoot Apr 27 '18 at 13:46
1

Yep, that's what I referred to with "pre-compile it", but using the `Compiled` option should store the regex in a global cache already, so that it doesn't need to be reinitialized nor garbage collected. – Blixt Apr 27 '18 at 13:52
Ahh. I didn't know that. Didn't decompile the code deep enough :) – bigfoot Apr 27 '18 at 14:02
Tried this today in C# (.NET 4.6.2), but any `^` chars are kept instead of being removed. – Kjara Feb 07 '19 at 15:30

score 19 · Answer 3 · answered Jul 13 '09 at 15:42

A regular expression will look like:

public string RemoveSpecialChars(string input)
{
    return Regex.Replace(input, @"[^0-9a-zA-Z\._]", string.Empty);
}

But if performance is highly important, I recommend you to do some benchmarks before selecting the "regex path"...

Steven Sudit · Answer 4 · 2009-07-13T15:44:20.633

15

I suggest creating a simple lookup table, which you can initialize in the static constructor to set any combination of characters to valid. This lets you do a quick, single check.

edit

Also, for speed, you'll want to initialize the capacity of your StringBuilder to the length of your input string. This will avoid reallocations. These two methods together will give you both speed and flexibility.

another edit

I think the compiler might optimize it out, but as a matter of style as well as efficiency, I recommend foreach instead of for.

edited Jul 13 '09 at 15:44

answered Jul 13 '09 at 15:39

Steven Sudit

19,391
1
51
53

For arrays, `for` and `foreach` produce similar code. I don't know about strings though. I doubt that the JIT knows about the array-like nature of String. – Christian Klauser Jul 13 '09 at 15:52
1

I bet the JIT knows more about the array-like nature of string than your [joke removed]. Anders etal did a lot of work optimizing everything about strings in .net – Jul 13 '09 at 16:02
I've done this using HashSet and it is about 2x slower than his method. Using bool[] is barely faster (0.0469ms/iter v. 0.0559ms/iter) than the version he has in OP...with the problem of being less readable. – user7116 Jul 13 '09 at 16:36
I used an int[] with 0 or 1, since alignment affects speed. Wasn't able to find anything faster. – Steven Sudit Jul 13 '09 at 16:40
1

I couldn't see any performance difference between using a bool array and an int array. I would use a bool array, as it brings down the lookup table from 256 kb to 64 kb, but it's still a lot of data for such a trivial function... And it's only about 30% faster. – Guffa Jul 13 '09 at 17:07
Guffa, I'm going to answer in parts. 1) I looked up my notes and it seems that I was incorrect. Boolean was about the same speed as Integer, so that's what I used. – Steven Sudit Jul 13 '09 at 18:16
1

@Guffa 2) Given that we're only keeping alphanumerics and a few Basic Latin characters, we only need a table for the low byte, so size isn't really an issue. If we wanted to be general-purpose, then the standard Unicode technique is double-indirection. In other words, a table of 256 table references, many of which point to the same empty table. – Steven Sudit Jul 13 '09 at 18:17
@Guffa 3) The speed boost depends very much on how complex the criteria are that we need to check for each character. Compared to, say, two checks for 0-9, the table approach isn't a huge win. But its speed remains constant even if the pattern is effectively random and would take hundreds of checks. – Steven Sudit Jul 13 '09 at 18:19

score 15 · Answer 5 · answered Jul 13 '09 at 16:06

15

public static string RemoveSpecialCharacters(string str)
{
    char[] buffer = new char[str.Length];
    int idx = 0;

    foreach (char c in str)
    {
        if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z')
            || (c >= 'a' && c <= 'z') || (c == '.') || (c == '_'))
        {
            buffer[idx] = c;
            idx++;
        }
    }

    return new string(buffer, 0, idx);
}

answered Jul 13 '09 at 16:06

LukeH

263,068
57
365
409

1

+1, tested and it is about 40% faster than StringBuilder. 0.0294ms/string v. 0.0399ms/string – user7116 Jul 13 '09 at 17:14
Just to be sure, do you mean StringBuilder with or without pre-allocation? – Steven Sudit Jul 13 '09 at 19:28
With pre-allocation, it is still 40% slower than the char[] allocation and new string. – user7116 Jul 14 '09 at 02:28
2

I like this. I tweaked this method `foreach (char c in input.Where(c => char.IsLetterOrDigit(c) || allowedSpecialCharacters.Any(x => x == c))) buffer[idx++] = c;` – Chris Marisic Oct 17 '12 at 15:47

score 14 · Answer 6 · answered Jul 04 '12 at 21:31

If you're using a dynamic list of characters, LINQ may offer a much faster and graceful solution:

public static string RemoveSpecialCharacters(string value, char[] specialCharacters)
{
    return new String(value.Except(specialCharacters).ToArray());
}

I compared this approach against two of the previous "fast" approaches (release compilation):

Char array solution by LukeH - 427 ms
StringBuilder solution - 429 ms
LINQ (this answer) - 98 ms

Note that the algorithm is slightly modified - the characters are passed in as an array rather than hard-coded, which could be impacting things slightly (ie/ the other solutions would have an inner foor loop to check the character array).

If I switch to a hard-coded solution using a LINQ where clause, the results are:

Char array solution - 7ms
StringBuilder solution - 22ms
LINQ - 60 ms

Might be worth looking at LINQ or a modified approach if you're planning on writing a more generic solution, rather than hard-coding the list of characters. LINQ definitely gives you concise, highly readable code - even more so than Regex.

This approach looks nice, but it doesn't work - Except() is a set operation, so you will end up with only the first appearance of each unique character in the string. — McKenzieG1, Mar 02 '17 at 17:50

lc. · Answer 7 · 2009-07-13T15:59:25.673

5

I'm not convinced your algorithm is anything but efficient. It's O(n) and only looks at each character once. You're not gonna get any better than that unless you magically know values before checking them.

I would however initialize the capacity of your StringBuilder to the initial size of the string. I'm guessing your perceived performance problem comes from memory reallocation.

Side note: Checking A-z is not safe. You're including [, \, ], ^, _, and `...

Side note 2: For that extra bit of efficiency, put the comparisons in an order to minimize the number of comparisons. (At worst, you're talking 8 comparisons tho, so don't think too hard.) This changes with your expected input, but one example could be:

if (str[i] >= '0' && str[i] <= 'z' && 
    (str[i] >= 'a' || str[i] <= '9' ||  (str[i] >= 'A' && str[i] <= 'Z') || 
    str[i] == '_') || str[i] == '.')

Side note 3: If for whatever reason you REALLY need this to be fast, a switch statement may be faster. The compiler should create a jump table for you, resulting in only a single comparison:

switch (str[i])
{
    case '0':
    case '1':
    .
    .
    .
    case '.':
        sb.Append(str[i]);
        break;
}

edited Jul 13 '09 at 15:59

answered Jul 13 '09 at 15:43

lc.

113,939
20
158
187

1

I agree that you can't beat O(n) on this one. However, there is a cost per comparison which can be lowered. A table lookup has a low, fixed cost, while a series of comparisons is going to increase in cost as you add more exceptions. – Steven Sudit Jul 13 '09 at 15:47
About side note 3, do you really think the jump table would be faster than table lookup? – Steven Sudit Jul 13 '09 at 16:12
I ran the quick performance test on the switch solution, and it performs the same as the comparison. – Guffa Jul 13 '09 at 16:54
@Steven Sudit - I'd venture they're actually about the same. Care to run a test? – lc. Jul 13 '09 at 17:12
7

O(n) notation sometimes pisses me off. People will make stupid assumptions based on the fact the algorithm is already O(n). If we changed this routine to replace the str[i] calls with a function that retrieved the comparison value by constructing a one-time SSL connection with a server on the opposite side of the world... you damn sure would see a massive performance difference and the algorithm is STILL O(n). The cost of O(1) for each algorithm is significant and NOT equivalent! – darron Jul 13 '09 at 17:47
@Ic: I'm not sure that straight tables would be much faster, but I doubt they'd be slower. Both approaches involve a lookup; the jump table one also involves an additional branch, and that's typically slow. I would also avoid the jump table solution in the first place because of the inflexibility. – Steven Sudit Jul 13 '09 at 18:24

score 5 · Answer 8 · answered Dec 10 '17 at 15:48

5

You can use regular expresion as follows:

return Regex.Replace(strIn, @"[^\w\.@-]", "", RegexOptions.None, TimeSpan.FromSeconds(1.0));

answered Dec 10 '17 at 15:48

Giovanny Farto M.

1,557
18
20

score 4 · Answer 9 · answered Jul 13 '09 at 15:42

4

It seems good to me. The only improvement I would make is to initialize the StringBuilder with the length of the string.

StringBuilder sb = new StringBuilder(str.Length);

answered Jul 13 '09 at 15:42

bruno conde

47,767
15
98
117

score 4 · Answer 10 · edited Apr 28 '10 at 03:16

4

StringBuilder sb = new StringBuilder();

for (int i = 0; i < fName.Length; i++)
{
   if (char.IsLetterOrDigit(fName[i]))
    {
       sb.Append(fName[i]);
    }
}

edited Apr 28 '10 at 03:16

sth

222,467
53
283
367

answered Mar 27 '10 at 19:32

Chamika Sandamal

23,565
5
63
86

James Westgate · Answer 11 · 2022-01-02T12:24:27.057

Another way that attempts to improve performance by reducing allocations, especially if this function is called many times.

It works because you can guarantee the result won't be longer than the input, so the input and output can be passed without creating extra copies in memory. For this reason you can't use stackalloc to create the buffer array as this would require a copy out of the buffer.

public static string RemoveSpecialCharacters(this string str)
{
    return RemoveSpecialCharacters(str.AsSpan()).ToString();
}

public static ReadOnlySpan<char> RemoveSpecialCharacters(this ReadOnlySpan<char> str)
{
    Span<char> buffer = new char[str.Length];
    int idx = 0;

    foreach (char c in str)
    {
        if (char.IsLetterOrDigit(c))
        {
            buffer[idx] = c;
            idx++;
        }
    }

    return buffer.Slice(0, idx);
}

score 3 · Answer 12 · answered Jan 03 '18 at 18:34

There are lots of proposed solutions here, some more efficient than others, but perhaps not very readable. Here's one that may not be the most efficient, but certainly usable for most situations, and is quite concise and readable, leveraging Linq:

string stringToclean = "This is a test.  Do not try this at home; you might get hurt. Don't believe it?";

var validPunctuation = new HashSet<char>(". -");

var cleanedVersion = new String(stringToclean.Where(x => (x >= 'A' && x <= 'Z') || (x >= 'a' && x <= 'z') || validPunctuation.Contains(x)).ToArray());

var cleanedLowercaseVersion = new String(stringToclean.ToLower().Where(x => (x >= 'a' && x <= 'z') || validPunctuation.Contains(x)).ToArray());

score 3 · Answer 13 · answered Dec 13 '11 at 18:59

I agree with this code sample. The only different it I make it into Extension Method of string type. So that you can use it in a very simple line or code:

string test = "abc@#$123";
test.RemoveSpecialCharacters();

Thank to Guffa for your experiment.

public static class MethodExtensionHelper
    {
    public static string RemoveSpecialCharacters(this string str)
        {
            StringBuilder sb = new StringBuilder();
            foreach (char c in str)
            {
                if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '_')
                {
                    sb.Append(c);
                }
            }
            return sb.ToString();
        }
}

score 2 · Answer 14 · answered Jul 13 '09 at 15:38

2

I would use a String Replace with a Regular Expression searching for "special characters", replacing all characters found with an empty string.

answered Jul 13 '09 at 15:38

Stephen Wrighton

36,783
6
67
86

+1 certainly less code and arguably more readable ignoring write-once Regex. – kenny Jul 13 '09 at 16:38
1

@kenny - I agree. The original question even states that the strings are short - 10-30 chars. But apparently a lot of people still think we're selling CPU time by the second... – Tom Bushell Nov 12 '11 at 00:10
Reguler expressin works so lazy.So it shouldn't be used always. – RockOnGom Jul 10 '13 at 19:45

Daniel Blankensteiner · Answer 15 · 2013-11-26T15:06:36.333

I had to do something similar for work, but in my case I had to filter all that is not a letter, number or whitespace (but you could easily modify it to your needs). The filtering is done client-side in JavaScript, but for security reasons I am also doing the filtering server-side. Since I can expect most of the strings to be clean, I would like to avoid copying the string unless I really need to. This let my to the implementation below, which should perform better for both clean and dirty strings.

public static string EnsureOnlyLetterDigitOrWhiteSpace(string input)
{
    StringBuilder cleanedInput = null;
    for (var i = 0; i < input.Length; ++i)
    {
        var currentChar = input[i];
        var charIsValid = char.IsLetterOrDigit(currentChar) || char.IsWhiteSpace(currentChar);

        if (charIsValid)
        {
            if(cleanedInput != null)
                cleanedInput.Append(currentChar);
        }
        else
        {
            if (cleanedInput != null) continue;
            cleanedInput = new StringBuilder();
            if (i > 0)
                cleanedInput.Append(input.Substring(0, i));
        }
    }

    return cleanedInput == null ? input : cleanedInput.ToString();
}

score 1 · Answer 16 · edited Nov 26 '13 at 15:19

The following code has the following output (conclusion is that we can also save some memory resources allocating array smaller size):

lookup = new bool[123];

for (var c = '0'; c <= '9'; c++)
{
    lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}

for (var c = 'A'; c <= 'Z'; c++)
{
    lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}

for (var c = 'a'; c <= 'z'; c++)
{
    lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}

48: 0  
49: 1  
50: 2  
51: 3  
52: 4  
53: 5  
54: 6  
55: 7  
56: 8  
57: 9  
65: A  
66: B  
67: C  
68: D  
69: E  
70: F  
71: G  
72: H  
73: I  
74: J  
75: K  
76: L  
77: M  
78: N  
79: O  
80: P  
81: Q  
82: R  
83: S  
84: T  
85: U  
86: V  
87: W  
88: X  
89: Y  
90: Z  
97: a  
98: b  
99: c  
100: d  
101: e  
102: f  
103: g  
104: h  
105: i  
106: j  
107: k  
108: l  
109: m  
110: n  
111: o  
112: p  
113: q  
114: r  
115: s  
116: t  
117: u  
118: v  
119: w  
120: x  
121: y  
122: z

You can also add the following code lines to support Russian locale (array size will be 1104):

for (var c = 'А'; c <= 'Я'; c++)
{
    lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}

for (var c = 'а'; c <= 'я'; c++)
{
    lookup[c] = true; System.Diagnostics.Debug.WriteLine((int)c + ": " + (char)c);
}

Christian Klauser · Answer 17 · 2013-09-25T17:38:39.830

1

I wonder if a Regex-based replacement (possibly compiled) is faster. ~~Would have to test that~~ Someone has found this to be ~5 times slower.

Other than that, you should initialize the StringBuilder with an expected length, so that the intermediate string doesn't have to be copied around while it grows.

A good number is the length of the original string, or something slightly lower (depending on the nature of the functions inputs).

Finally, you can use a lookup table (in the range 0..127) to find out whether a character is to be accepted.

edited Sep 25 '13 at 17:38

answered Jul 13 '09 at 15:50

Christian Klauser

4,416
3
31
42

A regular expression has been tested already, and it's about five times slower. With a lookup table in the range 0..127 you still have to range check the character code before using the lookup table, as characters are 16 bit values, not 7 bit values. – Guffa Sep 24 '13 at 21:56
@Guffa Err... yes? ;) – Christian Klauser Sep 25 '13 at 17:39

score 1 · Answer 18 · answered Jul 13 '09 at 16:16

For S&G's, Linq-ified way:

var original = "(*^%foo)(@)&^@#><>?:\":';=-+_";
var valid = new char[] { 
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 
    'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 
    'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 
    'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '1', '2', '3', '4', '5', '6', '7', '8', 
    '9', '0', '.', '_' };
var result = string.Join("",
    (from x in original.ToCharArray() 
     where valid.Contains(x) select x.ToString())
        .ToArray());

I don't think this is going to be the most efficient way, however.

It's not, because it's a linear search. – Steven Sudit Jul 13 '09 at 16:41 — Steven Sudit, Jul 13 '09 at 16:41

score 1 · Answer 19 · edited Sep 24 '12 at 00:18

1

Use:

s.erase(std::remove_if(s.begin(), s.end(), my_predicate), s.end());

bool my_predicate(char c)
{
 return !(isalpha(c) || c=='_' || c==' '); // depending on you definition of special characters
}

And you'll get a clean string s.

erase() will strip it of all the special characters and is highly customisable with the my_predicate() function.

edited Sep 24 '12 at 00:18

Austin Henley

4,625
13
45
80

answered Sep 23 '12 at 08:02

Bhavya Agarwal

107
1
11

paparazzo · Answer 20 · 2013-09-24T20:24:10.660

1

HashSet is O(1)
Not sure if it is faster than the existing comparison

private static HashSet<char> ValidChars = new HashSet<char>() { 'a', 'b', 'c', 'A', 'B', 'C', '1', '2', '3', '_' };
public static string RemoveSpecialCharacters(string str)
{
    StringBuilder sb = new StringBuilder(str.Length / 2);
    foreach (char c in str)
    {
        if (ValidChars.Contains(c)) sb.Append(c);
    }
    return sb.ToString();
}

I tested and this in not faster than the accepted answer.
I will leave it up as if you needed a configurable set of characters this would be a good solution.

edited Sep 24 '13 at 20:24

answered Sep 24 '13 at 19:29

paparazzo

44,497
23
105
176

Why do you think that the comparison is not O(1)? – Guffa Sep 24 '13 at 19:47
@Guffa I am not sure it is not and I removed my comment. And +1. I should have done more testing before making the comment. – paparazzo Sep 24 '13 at 20:16

score 1 · Answer 21 · edited Apr 28 '10 at 03:16

1

public string RemoveSpecial(string evalstr)
{
StringBuilder finalstr = new StringBuilder();
            foreach(char c in evalstr){
            int charassci = Convert.ToInt16(c);
            if (!(charassci >= 33 && charassci <= 47))// special char ???
             finalstr.append(c);
            }
return finalstr.ToString();
}

edited Apr 28 '10 at 03:16

sth

222,467
53
283
367

answered Apr 27 '10 at 17:19

Shiko

11
1

score 0 · Answer 22 · answered Aug 25 '15 at 00:16

I'm not sure it is the most efficient way, but It works for me

 Public Function RemoverTildes(stIn As String) As String
    Dim stFormD As String = stIn.Normalize(NormalizationForm.FormD)
    Dim sb As New StringBuilder()

    For ich As Integer = 0 To stFormD.Length - 1
        Dim uc As UnicodeCategory = CharUnicodeInfo.GetUnicodeCategory(stFormD(ich))
        If uc <> UnicodeCategory.NonSpacingMark Then
            sb.Append(stFormD(ich))
        End If
    Next
    Return (sb.ToString().Normalize(NormalizationForm.FormC))
End Function

The answer _does_ work, but the question was for **C#.** (P.S: I know this was practically five years ago, but still..) I used the Telerik VB to C# Converter, (And vice-versa) and the code worked just fine - not sure about anyone else, though. (Another thing, https://converter.telerik.com/) — Momoro, Apr 18 '20 at 03:29

score 0 · Answer 23 · answered May 19 '21 at 16:00

Shortest way just a 3 line...

public static string RemoveSpecialCharacters(string str)
{
    var sb = new StringBuilder();
    foreach (var c in str.Where(c => c >= '0' && c <= '9' || c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z' || c == '.' || c == '_')) sb.Append(c); 
    return sb.ToString();
}

Mykola Uspalenko · Answer 24 · 2022-06-27T22:55:08.567

If you need to clean up the input string in case of injections or typos (rare events), the fastest way is to use the switch() to check all characters (the compiler does a good job of optimizing the execution time of switch() ) plus the additional code to remove the unwanted characters if there were found. Here is the solution:

    public static string RemoveExtraCharacters(string input)
    {
        if (string.IsNullOrEmpty(input))
            return "";

        input = input.Trim();

        StringBuilder sb = null;

    reStart:
        if (!string.IsNullOrEmpty(input))
        {
            var len = input.Length; ;

            for (int i = 0; i < len; i++)
            {
                switch (input[i])
                {
                    case '0':
                    case '1':
                    case '2':
                    case '3':
                    case '4':
                    case '5':
                    case '6':
                    case '7':
                    case '8':
                    case '9':
                    case 'A':
                    case 'B':
                    case 'C':
                    case 'D':
                    case 'E':
                    case 'F':
                    case 'G':
                    case 'H':
                    case 'I':
                    case 'J':
                    case 'K':
                    case 'L':
                    case 'M':
                    case 'N':
                    case 'O':
                    case 'Q':
                    case 'P':
                    case 'R':
                    case 'S':
                    case 'T':
                    case 'U':
                    case 'V':
                    case 'W':
                    case 'X':
                    case 'Y':
                    case 'Z':
                    case 'a':
                    case 'b':
                    case 'c':
                    case 'd':
                    case 'e':
                    case 'f':
                    case 'g':
                    case 'h':
                    case 'i':
                    case 'j':
                    case 'k':
                    case 'l':
                    case 'm':
                    case 'n':
                    case 'o':
                    case 'q':
                    case 'p':
                    case 'r':
                    case 's':
                    case 't':
                    case 'u':
                    case 'v':
                    case 'w':
                    case 'x':
                    case 'y':
                    case 'z':
                    case '/':
                    case '_':
                    case '-':
                    case '+':
                    case '.':
                    case ',':
                    case '*':
                    case ':':
                    case '=':
                    case ' ':
                    case '^':
                    case '$':
                        break;  

                    default:
                        if (sb == null)
                            sb = new StringBuilder();

                        sb.Append(input.Substring(0, i));
                        if (i + 1 < len)
                        {
                            input = input.Substring(i + 1);
                            goto reStart;
                        }
                        else
                            input = null;
                        break;
                }
            }
        }

        if (sb != null)
        {
            if (input != null)
                sb.Append(input);
            return sb.ToString();
        }

        return input;
    }

score -1 · Answer 25 · edited Jun 13 '20 at 11:59

-1

public static string RemoveAllSpecialCharacters(this string text) {
  if (string.IsNullOrEmpty(text))
    return text;

  string result = Regex.Replace(text, "[:!@#$%^&*()}{|\":?><\\[\\]\\;'/.,~]", " ");
  return result;
}

edited Jun 13 '20 at 11:59

kyun

9,710
9
31
66

answered Jun 13 '20 at 11:26

Hasan_H

77
4

Answer is wrong. If you are gonna use regex, it should be inclusive, not exclusive one, because you miss some characters now. Actually, there is already answer with regex. And to be full - regex is SLOWER then direct compare chars function. – TPAKTOPA Jun 13 '20 at 13:26

score -2 · Answer 26 · answered Sep 11 '21 at 09:15

-2

Simple way with LINQ

string text = "123a22 ";
var newText = String.Join(string.Empty, text.Where(x => x != 'a'));

answered Sep 11 '21 at 09:15

Akbar Asghari

613
8
19

Triynko · Answer 27 · 2009-07-13T16:26:01.127

If you're worried about speed, use pointers to edit the existing string. You could pin the string and get a pointer to it, then run a for loop over each character, overwriting each invalid character with a replacement character. It would be extremely efficient and would not require allocating any new string memory. You would also need to compile your module with the unsafe option, and add the "unsafe" modifier to your method header in order to use pointers.

static void Main(string[] args)
{
    string str = "string!$%with^&*invalid!!characters";
    Console.WriteLine( str ); //print original string
    FixMyString( str, ' ' );
    Console.WriteLine( str ); //print string again to verify that it has been modified
    Console.ReadLine(); //pause to leave command prompt open
}


public static unsafe void FixMyString( string str, char replacement_char )
{
    fixed (char* p_str = str)
    {
        char* c = p_str; //temp pointer, since p_str is read-only
        for (int i = 0; i < str.Length; i++, c++) //loop through each character in string, advancing the character pointer as well
            if (!IsValidChar(*c)) //check whether the current character is invalid
                (*c) = replacement_char; //overwrite character in existing string with replacement character
    }
}

public static bool IsValidChar( char c )
{
    return (c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c == '.' || c == '_');
    //return char.IsLetterOrDigit( c ) || c == '.' || c == '_'; //this may work as well
}

Noooooooooo! Changing a string in .NET is BAAAAAAAAAAAAD! Everything in the framework relies on the rule that strings are immutable, and if you break that you can get very surprising side effects... — Guffa, Jul 13 '09 at 16:52

score -3 · Answer 28 · edited Dec 18 '12 at 14:33

-3

public static string RemoveSpecialCharacters(string str){
    return str.replaceAll("[^A-Za-z0-9_\\\\.]", "");
}

edited Dec 18 '12 at 14:33

Rory McCrossan

331,213
40
305
339

answered Dec 18 '12 at 14:14

Jawaid

11
1

1

I'm afraid `replaceAll` is not C# String function but either Java or JavaScript – Csaba Toth Sep 27 '13 at 18:38

Most efficient way to remove special characters from string

28 Answers28

Linked

Related