C#: Removing common invalid characters from a string: improve this algorithm

Question

Consider the requirement to strip invalid characters from a string. The characters just need to be removed and replace with blank or string.Empty.

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example

foreach (char bad in BAD_CHARS)
{
    if (someString.Contains(bad))
      someString = someString.Replace(bad.ToString(), string.Empty);
}

I'd have really liked to do this:

if (BAD_CHARS.Any(bc => someString.Contains(bc)))
    someString.Replace(bc,string.Empty); // bc is out of scope

Question: Do you have any suggestions on refactoring this algoritm, or any simpler, easier to read, performant, maintainable algorithms?

I think the first one is simple, easy to read, and maintainable. — Sean Bright, Aug 25 '09 at 18:11
The first example has been fixed now, but the check for `Contains` is still redundant, and even then the algorithm isn't very efficient. — Noldorin, Aug 25 '09 at 18:18
Thanks Noldorin, appreciate your comments. In fact, this question is all about improving the algorithm. Thanks for the comments on its efficiencies! — p.campbell, Aug 25 '09 at 20:11

CAbbott · Answer 1 · 2009-08-25T18:34:10.743

37

I don't know about the readability of it, but a regular expression could do what you need it to:

someString = Regex.Replace(someString, @"[!@#$%_]", "");

edited Aug 25 '09 at 18:34

answered Aug 25 '09 at 18:23

CAbbott

8,078
4
31
38

1

The regex should probably be either "[!@#$%_]" or "!|@|#|$|%|_". – Daniel Brückner Aug 25 '09 at 18:31
Is it faster than Noldorin's solution? – CasperT Aug 25 '09 at 18:32
@CaspterT: Afraid not - it's going to be `O(nm)` at best, since it uses a state machine. Even compiled regex will naturally be slower. Also, not so practical for a large number of chars. – Noldorin Aug 25 '09 at 18:37
@Noldorin - you're right. I was coming at it from a simplicity, elegance standpoint rather than pure performance one. – CAbbott Aug 25 '09 at 18:41
2

@Noldorin: I disagree. The assemblying of a new string will happen only once using the regex, and many times using a lot of Replace calls. Assemblying a new string implies allocating memory (= expensive), and Regex.Replace has better cache locality (= cheaper) – Massa Aug 25 '09 at 18:51
Now I would love to see some benchmarks. Replacing or removing chars in a string is common practice! – CasperT Aug 25 '09 at 19:03
@Casper, Noldorin: Regular expressions are a proven technology for string parsing and manipulation, and the technology has evolved into something VERY fast. While this solution isn't as fast as Noldorin's bare-metal solution, it is a great deal more elegant (IMO), and is considerably faster than the constant string reconstruction of the OP's question. – Adam Robinson Aug 25 '09 at 19:55
I could recommend to use static method Regex.Replace(input, @"[!@#$%_]", output, RegexOptions.Compiled) because it supports RegexOptions which increases the speed – abatishchev Aug 25 '09 at 21:50
@abatishchev: It is still slower than any of the List/Hash/bool[] solutions. I do agree it is more readable. – Erich Mirabal Aug 25 '09 at 21:55
@CAbbott: Fair enough then. I'm not sure I agree it's more elegant, though it's probably simpler from most perspectives. – Noldorin Aug 25 '09 at 22:53
1

@CasperT here are some bench marks (avg. over 1,000,000 iterations in milliseconds) Noldorin: 0,0072813432 (HashSet) RuneFS: 0,0030156636 (Split and concatenation) 28OZ28: 0,0091563672 (Hashing) CAbbot: 0,01132827 (RegEX) – Rune FS Aug 26 '09 at 07:30
1

Here are my benchmarks. Running over 2 Million iterations, using Stopwatch to get millisecond timings: List = 7337. HashSet = 4892. bool[] = 2940. BitArray = 3105. RegEx (compiled) = 8700. Smaller is better (obviously), and bool[] was the fastest (if not the most memory hungry). BitArray gives a very nice trade-off between speed and size. – Erich Mirabal Aug 26 '09 at 12:24
Forgot about the original Replace implementation. That one, using the same conditions as the others, took 4851 milliseconds. So faster than the List, edges out the HashSet, but still slower than the bool/BitArray options. – Erich Mirabal Aug 26 '09 at 12:35

Rune FS · Accepted Answer · 2016-07-15T08:17:26.313

23

char[] BAD_CHARS = new char[] { '!', '@', '#', '$', '%', '_' }; //simple example
someString = string.Concat(someString.Split(BAD_CHARS,StringSplitOptions.RemoveEmptyEntries));

should do the trick (sorry for any smaller syntax errors I'm on my phone)

edited Jul 15 '16 at 08:17

answered Aug 25 '09 at 19:12

Rune FS

21,497
7
62
96

StringSplitOption should be StringSplitOptions – itchi Jul 14 '16 at 18:05

score 18 · Answer 3 · edited Aug 26 '09 at 00:32

18

The string class is immutable (although a reference type), hence all its static methods are designed to return a new string variable. Calling someString.Replace without assigning it to anything will not have any effect in your program. - Seems like you fixed this problem.

The main issue with your suggested algorithm is that it repeatedly assigning many new string variables, potentially causing a big performance hit. LINQ doesn't really help things here. (I doesn't make the code significantly shorter and certainly not any more readable, in my opinion.)

Try the following extension method. The key is the use of StringBuilder, which means only one block of memory is assigned for the result during execution.

private static readonly HashSet<char> badChars = 
    new HashSet<char> { '!', '@', '#', '$', '%', '_' };

public static string CleanString(this string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!badChars.Contains(str[i]))
            result.Append(str[i]);
    }
    return result.ToString();
}

This algorithm also makes use of the .NET 3.5 'HashSet' class to give O(1) look up time for detecting a bad char. This makes the overall algorithm O(n) rather than the O(nm) of your posted one (m being the number of bad chars); it also is lot a better with memory usage, as explained above.

edited Aug 26 '09 at 00:32

LukeH

263,068
57
365
409

answered Aug 25 '09 at 18:16

Noldorin

144,213
56
264
302

4

If you use a hashtable/dictionary to store the bad characters, the lookup would be O(1) instead of O(m). That could also allow for a custom replacement of the character (i.e. if he wanted to replace '@' with 'at' instead of '' in the future). – Erich Mirabal Aug 25 '09 at 18:21
If you're going to go to the trouble to obfuscate the code in this manner, why not just use a RegEx? – Adam Robinson Aug 25 '09 at 18:33
1

@Adam: Obfuscated - how so? To me this code is both perfectly clear and very efficient. See my comment on the other post for why not to use RegEx. – Noldorin Aug 25 '09 at 18:38
1

Also. Just a small comment. You have to use str.Length in your loop - not the stringbuilder :) it doesn't work that way – CasperT Aug 25 '09 at 19:06
That an algorithm runs in O(nm) does _not_ state that it will always be slower than an algorithm running in O(1) it _only_ states that the execution time of one is affected by the number of two factors (in this case elements in to collections) wheras the other is unaffected by any factors. in this case the O(nm) will run faster than the O(1) for a low nm. – Rune FS Aug 25 '09 at 19:17
1

@Rune FS: I'd like to see your numbers that give you that result. I just tried it with both the list and hashset implementation, and the hashset was always faster (it took 1/2 to 2/3 of the time of the list). My `nm` was the base string he gave (i.e. both the input and badchar were the same size -- 6 characters long) so it was very low as you claim. – Erich Mirabal Aug 25 '09 at 19:40
(@Rune FS: as a side note, yes, I agree with you that O(1) can be slower than O(nm) for certain sizes of nm, just not in this case). – Erich Mirabal Aug 25 '09 at 19:43
I agree with Erich fully. Rune FS: it should be obvious to you that this algorithm is quicker than the `O(nm)` alternative. Any other reason for the down-vote? :P – Noldorin Aug 25 '09 at 22:50
@Noldorin and Erich. First of all any down votes was note from me I up voted cuz I like your solution for optimizing the hashing however the solution I've proposed below measured over 1000000 iterations ran in 27 units of time wheras yours ran in 43 on my machine (excluding the hashing initialisation which bloated it to a factor 10 for one iteration). The split solution is O(nm) – Rune FS Aug 26 '09 at 06:45
@myself the above numbers are incorrect the 43 is for the optimized hashing version by28OZ28 the solution in this answer ran in 85 – Rune FS Aug 26 '09 at 07:06
@Rune FS: Fair enough. Not sure exactly *why* yours runs quicker, though. (Perhaps the internal optimisation of the static `string` methods?) – Noldorin Aug 26 '09 at 10:14
There seems no point adding a separate answer (especially since the accepted one is frankly much worse). Have you considered noting that the hash can be replaced with a simple switch based static method given the rules are compile time known... – ShuggyCoUk Sep 01 '09 at 00:15

score 7 · Answer 4 · edited May 23 '17 at 12:18

7

This one is faster than HashSet<T>. Also, if you have to perform this action often, please consider the foundations for this question I asked here.

private static readonly bool[] BadCharValues;

static StaticConstructor()
{
    BadCharValues = new bool[char.MaxValue+1];
    char[] badChars = { '!', '@', '#', '$', '%', '_' };
    foreach (char c in badChars)
        BadCharValues[c] = true;
}

public static string CleanString(string str)
{
    var result = new StringBuilder(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        if (!BadCharValues[str[i]])
            result.Append(str[i]);
    }
    return result.ToString();
}

edited May 23 '17 at 12:18

Community

1
1

answered Aug 25 '09 at 20:50

Sam Harwell

97,721
20
209
280

You basically created an expanded version of the hashset where you are guaranteed no collisions and no need to compute a hash value since the hash value is the character value. +1 since it is faster without paying a big memory penalty. – Erich Mirabal Aug 25 '09 at 21:36
(for the curious: this is still faster than using a BitArray, even though bool[] takes up a bit (ahem) more space) – Erich Mirabal Aug 25 '09 at 21:48
1

The memory penalty is significant, but indeed this uses an optimised version of the `HashSet`. Oh, and thanks for copying my code without attribute. – Noldorin Aug 26 '09 at 10:12
a 65K lookup table with only 6 entries of note will almost certainly perform worse than a hash based approach due to the memory pressure on your cache. If it's compile time known write it as a switch statement and be done with it... – ShuggyCoUk Sep 01 '09 at 00:12

score 4 · Answer 5 · edited Nov 08 '18 at 22:33

4

if you still want to do it in a LINQy way:

public static string CleanUp(this string orig)
{
    var badchars = new HashSet<char>() { '!', '@', '#', '$', '%', '_' };

    return new string(orig.Where(c => !badchars.Contains(c)).ToArray());
}

edited Nov 08 '18 at 22:33

Ben Gripka

16,012
6
45
41

answered Aug 25 '09 at 18:38

Mike Jacobs

509
3
4

You don't need the ToCharArray at all :) Remove it and the code still works as it should. – CasperT Aug 25 '09 at 19:10
And you can use a HashSet instead of a List. – Robert Rossney Aug 25 '09 at 21:16

score 4 · Answer 6 · answered Apr 29 '11 at 23:17

Extra tip: If you don't want to remember the array of char that are invalid for Files, you could use Path.GetInvalidFileNameChars(). If you wanted it for Paths, it's Path.GetInvalidPathChars

private static string RemoveInvalidChars(string str)
            {
                return string.Concat(str.Split(Path.GetInvalidFileNameChars(), StringSplitOptions.RemoveEmptyEntries));
            }

score 3 · Answer 7 · edited Aug 25 '09 at 20:35

Why would you have REALLY LIKED to do that? The code is absolutely no simpler, you're just forcing a query extension method into your code.

As an aside, the Contains check seems redundant, both conceptually and from a performance perspective. Contains has to run through the whole string anyway, you may as well just call Replace(bad.ToString(), string.Empty) for every character and forget about whether or not it's actually present.

Of course, a regular expression is always an option, and may be more performant (if not less clear) in a situation like this.

score 3 · Answer 8 · answered Aug 25 '09 at 18:27

Something to consider -- if this is for passwords (say), you want to scan for and keep good characters, and assume everything else is bad. Its easier to correctly filter or good things, then try to guess all bad things.

For Each Character If Character is Good -> Keep it (copy to out buffer, whatever.)

jeff

Jason Kleban · Answer 9 · 2009-08-31T23:56:08.017

3

This is pretty clean. Restricts it to valid characters instead of removing invalid ones. You should split it to constants probably:

string clean = new string(@"Sour!ce Str&*(@ing".Where(c => 
@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,.".Contains(c)).ToArray()

edited Aug 31 '09 at 23:56

answered Aug 25 '09 at 21:42

Jason Kleban

20,024
18
75
125

C#: Removing common invalid characters from a string: improve this algorithm

9 Answers9

Linked