C# remove carriage returns, line breaks and whitespaces from string as efficient as possible (benchmark)

Question

In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?

To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2

string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";

@MikeBeaton Yes I know, but thanks for the info and your answer, which helped me for another problem I needed to solve aswell ;-) — Daniel, Jun 26 '20 at 15:08
It matters a lot how long your strings actually are. I did some benchmarking and around the 10_000 characters mark a parallel method started outperforming the `NewString` method, once reaching the 100_000_000 characters mark the parallel version was a little over 15x as fast (also benchmarked with BenchmarkDotNet). — , Jun 26 '20 at 16:20
@Knoop Do you want me to specify this in the question? The strings I have are between 10 000 and 100 000 characters long. — Daniel, Jun 26 '20 at 16:22
With those lengths you have a good chance a Parallel method will out perform the accepted answer, but it also depends on the hardware it runs on. Optimization and efficiency problems are seldom straight forward. So if the accepted answer is fast enough I would just go with that (also from the point of view that it's best to avoid premature optimization) — , Jun 26 '20 at 17:35

Guru Stron · Accepted Answer · 2020-06-26T15:29:59.597

The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :

public class Replace
{
    private static readonly string S = " ab c\r\n de.\nf";
    private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);

    [Benchmark]
    public string SimpleReplace() => S
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","");

    [Benchmark]
    public string StringBuilder() => new StringBuilder().Append(S)
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","")
       .ToString();

    [Benchmark]
    public string RegexReplace() => Reg.Replace(S, "");

    [Benchmark]
    public string NewString()
    {
            var arr = new char[S.Length];
            var cnt = 0;
            for (int i = 0; i < S.Length; i++)
            {
                switch(S[i])
                {
                    case ' ':
                    case '\r':
                    case '\n':
                        break;

                    default:
                        arr[cnt] = S[i];
                        cnt++;
                        break;
                }
            }

            return new string(arr, 0, cnt);
    }

    [Benchmark]
    public string NewStringForeach()
    {
        var validCharacters = new char[S.Length];
        var next = 0;

        foreach(var c in S)
        {
            switch(c)
            {
                case ' ':
                case '\r':
                case '\n':
                    // Ignore then
                    break;

                default:
                    validCharacters[next++] = c;
                    break;
            }
        }

        return new string(validCharacters, 0, next);
    }
}

This gives on my machine:

|          Method |        Mean |     Error |    StdDev |
|---------------- |------------:|----------:|----------:|
|   SimpleReplace |   122.09 ns |  1.273 ns |  1.063 ns |
|   StringBuilder |   311.28 ns |  6.313 ns |  8.850 ns |
|    RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
|       NewString |    52.26 ns |  1.122 ns |  1.812 ns |
|NewStringForeach |    40.04 ns |  0.877 ns |  1.979 ns |

Thanks for your awesome answer! If I interpret this in the right way this means that the approach of @Sean is the quickest so far. Am I right? — Daniel, Jun 26 '20 at 14:29
@Daniel yep. But again, you should test against actual workload. — Guru Stron, Jun 26 '20 at 14:32
Since OP has stated that the desired optimization is specifically for large strings imo a good benchmark should represent that. Some solutions might have more startup overhead but be faster on a per character basis, making them lose out on short strings but come out ahead on very long strings. — , Jun 26 '20 at 14:45
@Knoop term "large" is very broad that is why i suggested to test(bench) against actual workload (and actual hardware). Potentially even frequency of specific characters can affect performance of one or another solution. — Guru Stron, Jun 26 '20 at 14:50
@GuruStron, since we have .net core 3.1, could you add foreach version in test result? You can copy it from here: https://dotnetfiddle.net/Gr6knH — sTrenat, Jun 26 '20 at 15:17

score 1 · Answer 2 · answered Jun 26 '20 at 14:09

1

To do this efficiently you want to avoid regex and keep memory allocations to a minimum: Here I've used a raw character buffer (rather than a StringBuilder) and for rather than foreach to optimize access to each character:

string Strip(string text)
{
    var validCharacters = new char[text.Length];
    var next = 0;

    for(int i = 0; i < text.Length; i++)
    {
        char c = text[i];

        switch(c)
        {
            case ' ':
            case '\r':
            case '\n':
                // Ignore then
                break;

            default:
                validCharacters[next++] = c;
                break;
        }
    }

    return new string(validCharacters, 0, next);
}

answered Jun 26 '20 at 14:09

Sean

60,939
11
97
136

Perhaps better to use something like char.IsWhitespace and char.IsControl since I'm assuming they want all white space/non printable removed – pinkfloydx33 Jun 26 '20 at 14:13
I cannot agree that `for` will be more efficient. I would even say that `for` will be slower due to optimalisation that compiler can do with `foreach` – sTrenat Jun 26 '20 at 14:28
1

@sTrenat - foreach will involve a call to get an an enumerator (probably a struct) and then a call `MoveNext` for each iteration and a call to `Current` to get the actual value. The implementation is then just grabbing the character from the specified index. All of this might get optimized away, but it might not. – Sean Jun 26 '20 at 14:31
I think you underestimate what compiler optimalisation can do. Check it yourself: https://dotnetfiddle.net/Gr6knH – sTrenat Jun 26 '20 at 14:36
@sTrenat - Thanks for the link. However, if you swap the calls to `Benchmark` around to that you do `StripForEach` first, followed by `Strip` then you'll find that `Strip` runs faster! : https://dotnetfiddle.net/BtrSXt – Sean Jun 26 '20 at 14:39
1

@sTrenat switch them around and the results are reversed in favor of `for`, check it yourself: https://dotnetfiddle.net/M12Lwh. So in other words, not a good benchmark. Do you have any actual theory/documentation to back this up? Last I know the optimizing the compiler actually does in the case of foreach on an array is iterating instead of using an enumerator, making it almost as fast as for – Jun 26 '20 at 14:40
You'r right, it seems that first call is always slower on this bench. But still seems like neither loop is faster. – sTrenat Jun 26 '20 at 15:01
When I run it with BenchmarkDotNet, result was: `Strip | 36.42 ns | 0.389 ns | 0.363 ns` `StripForeach | 27.83 ns | 0.488 ns | 0.457 ns`, seems like still `Foreach` is better – sTrenat Jun 26 '20 at 15:13

score 0 · Answer 3 · answered Jun 26 '20 at 13:56

0

var input = " ab c\r\n de.\nf";
var result = Regex.Replace(input, @"\s+", "");

// result is now "abcde.f"

You can see it in action here

answered Jun 26 '20 at 13:56

Blindy

65,249
10
91
131

Thanks for your answer! But is using regular Expressions the most efficient way? – Daniel Jun 26 '20 at 13:58
1

It's efficient enough! – Blindy Jun 26 '20 at 14:00

score 0 · Answer 4 · answered Jun 26 '20 at 13:57

0

You can do like this. You can define which special characters you want to allow in a config file. In my case I have defined in appsettings.json file.

private string RemoveUnnecessaryChars(string firstName)
{
    StringBuilder sb = new StringBuilder();
    string allowedCharacters = _configuration["AllowedChars"];
    foreach (char ch in firstName)
    {
        if (char.IsLetterOrDigit(ch))
        {
            sb.Append(ch);
        }
        else
        {
            if (allowedCharacters.Contains(ch))
            {
                sb.Append(ch);
            }
        }
    }

    return sb.ToString();
}

answered Jun 26 '20 at 13:57

Vivek Nuna

25,472
25
109
197

Thanks for your answer, however I hoped to find a more simple way to do this. – Daniel Jun 26 '20 at 14:07
4

Probably better to make a HashSet out of allowedCharacters (depending on what's there) but also probably better to define a black list rather than a whitelist – pinkfloydx33 Jun 26 '20 at 14:10
@pinkfloydx33 thanks, that is a valid point – Vivek Nuna Jun 26 '20 at 14:17

C# remove carriage returns, line breaks and whitespaces from string as efficient as possible (benchmark)

4 Answers4