1

In C# I have a String containing Whitespaces, carriage returns and/or line breaks. Is there a simple way to normalize large strings (100.000 to 1.000.000 characters) which are imported from textfiles as efficient as possible?

To clarify what I mean: Let's say my string looks like string1 but I want it to be like string2

string1 = " ab c\r\n de.\nf";
string2 = "abcde.f";
Daniel
  • 398
  • 2
  • 16
  • 1
    @MikeBeaton Yes I know, but thanks for the info and your answer, which helped me for another problem I needed to solve aswell ;-) – Daniel Jun 26 '20 at 15:08
  • 1
    It matters a lot how long your strings actually are. I did some benchmarking and around the 10_000 characters mark a parallel method started outperforming the `NewString` method, once reaching the 100_000_000 characters mark the parallel version was a little over 15x as fast (also benchmarked with BenchmarkDotNet). –  Jun 26 '20 at 16:20
  • @Knoop Do you want me to specify this in the question? The strings I have are between 10 000 and 100 000 characters long. – Daniel Jun 26 '20 at 16:22
  • 1
    With those lengths you have a good chance a Parallel method will out perform the accepted answer, but it also depends on the hardware it runs on. Optimization and efficiency problems are seldom straight forward. So if the accepted answer is fast enough I would just go with that (also from the point of view that it's best to avoid premature optimization) –  Jun 26 '20 at 17:35

4 Answers4

5

The term "efficiently" can heavily depend on your actual strings and number of them. I've come up with next benchmark (for BenchmarkDotNet) :

public class Replace
{
    private static readonly string S = " ab c\r\n de.\nf";
    private static readonly Regex Reg = new Regex(@"\s+", RegexOptions.Compiled);

    [Benchmark]
    public string SimpleReplace() => S
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","");

    [Benchmark]
    public string StringBuilder() => new StringBuilder().Append(S)
       .Replace(" ","")
       .Replace("\\r","")
       .Replace("\\n","")
       .ToString();

    [Benchmark]
    public string RegexReplace() => Reg.Replace(S, "");

    [Benchmark]
    public string NewString()
    {
            var arr = new char[S.Length];
            var cnt = 0;
            for (int i = 0; i < S.Length; i++)
            {
                switch(S[i])
                {
                    case ' ':
                    case '\r':
                    case '\n':
                        break;

                    default:
                        arr[cnt] = S[i];
                        cnt++;
                        break;
                }
            }

            return new string(arr, 0, cnt);
    }

    [Benchmark]
    public string NewStringForeach()
    {
        var validCharacters = new char[S.Length];
        var next = 0;

        foreach(var c in S)
        {
            switch(c)
            {
                case ' ':
                case '\r':
                case '\n':
                    // Ignore then
                    break;

                default:
                    validCharacters[next++] = c;
                    break;
            }
        }

        return new string(validCharacters, 0, next);
    }
} 

This gives on my machine:

|          Method |        Mean |     Error |    StdDev |
|---------------- |------------:|----------:|----------:|
|   SimpleReplace |   122.09 ns |  1.273 ns |  1.063 ns |
|   StringBuilder |   311.28 ns |  6.313 ns |  8.850 ns |
|    RegexReplace | 1,194.91 ns | 23.376 ns | 34.265 ns |
|       NewString |    52.26 ns |  1.122 ns |  1.812 ns |
|NewStringForeach |    40.04 ns |  0.877 ns |  1.979 ns |
Guru Stron
  • 102,774
  • 10
  • 95
  • 132
  • 1
    Thanks for your awesome answer! If I interpret this in the right way this means that the approach of @Sean is the quickest so far. Am I right? – Daniel Jun 26 '20 at 14:29
  • 1
    @Daniel yep. But again, you should test against actual workload. – Guru Stron Jun 26 '20 at 14:32
  • Since OP has stated that the desired optimization is specifically for large strings imo a good benchmark should represent that. Some solutions might have more startup overhead but be faster on a per character basis, making them lose out on short strings but come out ahead on very long strings. –  Jun 26 '20 at 14:45
  • @Knoop term "large" is very broad that is why i suggested to test(bench) against actual workload (and actual hardware). Potentially even frequency of specific characters can affect performance of one or another solution. – Guru Stron Jun 26 '20 at 14:50
  • @GuruStron, since we have .net core 3.1, could you add foreach version in test result? You can copy it from here: https://dotnetfiddle.net/Gr6knH – sTrenat Jun 26 '20 at 15:17
  • 1
    @sTrenat added as you requested. – Guru Stron Jun 26 '20 at 15:30
1

To do this efficiently you want to avoid regex and keep memory allocations to a minimum: Here I've used a raw character buffer (rather than a StringBuilder) and for rather than foreach to optimize access to each character:

string Strip(string text)
{
    var validCharacters = new char[text.Length];
    var next = 0;

    for(int i = 0; i < text.Length; i++)
    {
        char c = text[i];

        switch(c)
        {
            case ' ':
            case '\r':
            case '\n':
                // Ignore then
                break;

            default:
                validCharacters[next++] = c;
                break;
        }
    }

    return new string(validCharacters, 0, next);
}
Sean
  • 60,939
  • 11
  • 97
  • 136
  • Perhaps better to use something like char.IsWhitespace and char.IsControl since I'm assuming they want all white space/non printable removed – pinkfloydx33 Jun 26 '20 at 14:13
  • I cannot agree that `for` will be more efficient. I would even say that `for` will be slower due to optimalisation that compiler can do with `foreach` – sTrenat Jun 26 '20 at 14:28
  • 1
    @sTrenat - foreach will involve a call to get an an enumerator (probably a struct) and then a call `MoveNext` for each iteration and a call to `Current` to get the actual value. The implementation is then just grabbing the character from the specified index. All of this might get optimized away, but it might not. – Sean Jun 26 '20 at 14:31
  • I think you underestimate what compiler optimalisation can do. Check it yourself: https://dotnetfiddle.net/Gr6knH – sTrenat Jun 26 '20 at 14:36
  • @sTrenat - Thanks for the link. However, if you swap the calls to `Benchmark` around to that you do `StripForEach` first, followed by `Strip` then you'll find that `Strip` runs faster! : https://dotnetfiddle.net/BtrSXt – Sean Jun 26 '20 at 14:39
  • 1
    @sTrenat switch them around and the results are reversed in favor of `for`, check it yourself: https://dotnetfiddle.net/M12Lwh. So in other words, not a good benchmark. Do you have any actual theory/documentation to back this up? Last I know the optimizing the compiler actually does in the case of foreach on an array is iterating instead of using an enumerator, making it almost as fast as for –  Jun 26 '20 at 14:40
  • You'r right, it seems that first call is always slower on this bench. But still seems like neither loop is faster. – sTrenat Jun 26 '20 at 15:01
  • When I run it with BenchmarkDotNet, result was: `Strip | 36.42 ns | 0.389 ns | 0.363 ns` `StripForeach | 27.83 ns | 0.488 ns | 0.457 ns`, seems like still `Foreach` is better – sTrenat Jun 26 '20 at 15:13
0
var input = " ab c\r\n de.\nf";
var result = Regex.Replace(input, @"\s+", "");

// result is now "abcde.f"

You can see it in action here

Blindy
  • 65,249
  • 10
  • 91
  • 131
0

You can do like this. You can define which special characters you want to allow in a config file. In my case I have defined in appsettings.json file.

private string RemoveUnnecessaryChars(string firstName)
{
    StringBuilder sb = new StringBuilder();
    string allowedCharacters = _configuration["AllowedChars"];
    foreach (char ch in firstName)
    {
        if (char.IsLetterOrDigit(ch))
        {
            sb.Append(ch);
        }
        else
        {
            if (allowedCharacters.Contains(ch))
            {
                sb.Append(ch);
            }
        }
    }

    return sb.ToString();
}
Vivek Nuna
  • 25,472
  • 25
  • 109
  • 197