113

Let's say I have a string such as:

"Hello     how are   you           doing?"

I would like a function that turns multiple spaces into one space.

So I would get:

"Hello how are you doing?"

I know I could use regex or call

string s = "Hello     how are   you           doing?".replace("  "," ");

But I would have to call it multiple times to make sure all sequential whitespaces are replaced with only one.

Is there already a built in method for this?

LarsTech
  • 80,625
  • 14
  • 153
  • 225
Matt
  • 25,943
  • 66
  • 198
  • 303

17 Answers17

210
string cleanedString = System.Text.RegularExpressions.Regex.Replace(dirtyString,@"\s+"," ");
Frank van Puffelen
  • 565,676
  • 79
  • 828
  • 807
Tim Hoolihan
  • 12,316
  • 3
  • 41
  • 54
54

This question isn't as simple as other posters have made it out to be (and as I originally believed it to be) - because the question isn't quite precise as it needs to be.

There's a difference between "space" and "whitespace". If you only mean spaces, then you should use a regex of " {2,}". If you mean any whitespace, that's a different matter. Should all whitespace be converted to spaces? What should happen to space at the start and end?

For the benchmark below, I've assumed that you only care about spaces, and you don't want to do anything to single spaces, even at the start and end.

Note that correctness is almost always more important than performance. The fact that the Split/Join solution removes any leading/trailing whitespace (even just single spaces) is incorrect as far as your specified requirements (which may be incomplete, of course).

The benchmark uses MiniBench.

using System;
using System.Text.RegularExpressions;
using MiniBench;

internal class Program
{
    public static void Main(string[] args)
    {

        int size = int.Parse(args[0]);
        int gapBetweenExtraSpaces = int.Parse(args[1]);

        char[] chars = new char[size];
        for (int i=0; i < size/2; i += 2)
        {
            // Make sure there actually *is* something to do
            chars[i*2] = (i % gapBetweenExtraSpaces == 1) ? ' ' : 'x';
            chars[i*2 + 1] = ' ';
        }
        // Just to make sure we don't have a \0 at the end
        // for odd sizes
        chars[chars.Length-1] = 'y';

        string bigString = new string(chars);
        // Assume that one form works :)
        string normalized = NormalizeWithSplitAndJoin(bigString);


        var suite = new TestSuite<string, string>("Normalize")
            .Plus(NormalizeWithSplitAndJoin)
            .Plus(NormalizeWithRegex)
            .RunTests(bigString, normalized);

        suite.Display(ResultColumns.All, suite.FindBest());
    }

    private static readonly Regex MultipleSpaces = 
        new Regex(@" {2,}", RegexOptions.Compiled);

    static string NormalizeWithRegex(string input)
    {
        return MultipleSpaces.Replace(input, " ");
    }

    // Guessing as the post doesn't specify what to use
    private static readonly char[] Whitespace =
        new char[] { ' ' };

    static string NormalizeWithSplitAndJoin(string input)
    {
        string[] split = input.Split
            (Whitespace, StringSplitOptions.RemoveEmptyEntries);
        return string.Join(" ", split);
    }
}

A few test runs:

c:\Users\Jon\Test>test 1000 50
============ Normalize ============
NormalizeWithSplitAndJoin  1159091 0:30.258 22.93
NormalizeWithRegex        26378882 0:30.025  1.00

c:\Users\Jon\Test>test 1000 5
============ Normalize ============
NormalizeWithSplitAndJoin  947540 0:30.013 1.07
NormalizeWithRegex        1003862 0:29.610 1.00


c:\Users\Jon\Test>test 1000 1001
============ Normalize ============
NormalizeWithSplitAndJoin  1156299 0:29.898 21.99
NormalizeWithRegex        23243802 0:27.335  1.00

Here the first number is the number of iterations, the second is the time taken, and the third is a scaled score with 1.0 being the best.

That shows that in at least some cases (including this one) a regular expression can outperform the Split/Join solution, sometimes by a very significant margin.

However, if you change to an "all whitespace" requirement, then Split/Join does appear to win. As is so often the case, the devil is in the detail...

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    Great analysis. So it appears that we were both correct to varying degrees. The code in my answer was taken from a larger function which has the ability to normalize all whitespace and/or control characters from within a string and from the beginning and end. – Scott Dorman Aug 15 '09 at 00:34
  • 1
    With just the whitespace characters you specified, in most of my tests the regex and Split/Join were about equal - S/J had a tiny, tiny benefit, at the cost of correctness and complexity. For those reasons, I'd normally prefer the regex. Don't get me wrong - I'm far from a regex fanboy, but I don't like writing more complex code for the sake of performance without really testing the performance first. – Jon Skeet Aug 15 '09 at 06:27
  • NormalizeWithSplitAndJoin will create a lot more garbage, it is hard to tell if a real problem will get hit more more GC time then the banchmark. – Ian Ringrose Dec 20 '13 at 14:54
  • @IanRingrose What sort of garbage can be created? – Dronz Apr 10 '18 at 16:46
19

A regular expressoin would be the easiest way. If you write the regex the correct way, you wont need multiple calls.

Change it to this:

string s = System.Text.RegularExpressions.Regex.Replace(s, @"\s{2,}", " "); 
Lars Truijens
  • 42,837
  • 6
  • 126
  • 143
Brandon
  • 68,708
  • 30
  • 194
  • 223
  • My one issue with `@"\s{2,}"` is that it fails to replace single tabs and other Unicode space characters with a space. If you are going to replace 2 tabs with a space, then you should probably replace 1 tab with a space. `@"\s+"` will do that for you. – David Specht Dec 11 '18 at 21:11
18

While the existing answers are fine, I'd like to point out one approach which doesn't work:

public static string DontUseThisToCollapseSpaces(string text)
{
    while (text.IndexOf("  ") != -1)
    {
        text = text.Replace("  ", " ");
    }
    return text;
}

This can loop forever. Anyone care to guess why? (I only came across this when it was asked as a newsgroup question a few years ago... someone actually ran into it as a problem.)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I think I remember this question being asked awhile back on SO. IndexOf ignores certain characters that Replace doesn't. So the double space was always there, just never removed. – Brandon Aug 14 '09 at 20:08
  • 19
    It is because IndexOf ignores some Unicode characters, the specific culprate in this case being some asian character iirc. Hmm, zero-width non-joiner according to the Google. – ahawker Aug 14 '09 at 20:57
  • I learned that the hard way :( http://stackoverflow.com/questions/9260693/strange-results-from-indexof-on-german-string – Antonio Bakula Sep 21 '15 at 10:10
  • I learned the hard way. Especialy with two Zero Width Non Joiners (\u200C\u200C). IndexOf returns index of this "double space", but Replace does not replaces it. I think it is because for IndexOf, you need to specify StringComparsion (Ordinal) to behave the same as Replace. This way, neither of these two will locate "double spaces". More about StringComparsion https://learn.microsoft.com/en-us/dotnet/api/system.stringcomparison?view=netframework-4.8 – Martin Brabec Mar 02 '20 at 08:48
5

Here is the Solution i work with. Without RegEx and String.Split.

public static string TrimWhiteSpace(this string Value)
{
    StringBuilder sbOut = new StringBuilder();
    if (!string.IsNullOrEmpty(Value))
    {
        bool IsWhiteSpace = false;
        for (int i = 0; i < Value.Length; i++)
        {
            if (char.IsWhiteSpace(Value[i])) //Comparion with WhiteSpace
            {
                if (!IsWhiteSpace) //Comparison with previous Char
                {
                    sbOut.Append(Value[i]);
                    IsWhiteSpace = true;
                }
            }
            else
            {
                IsWhiteSpace = false;
                sbOut.Append(Value[i]);
            }
        }
    }
    return sbOut.ToString();
}

so you can:

string cleanedString = dirtyString.TrimWhiteSpace();
fubo
  • 44,811
  • 17
  • 103
  • 137
5

A fast extra whitespace remover by Felipe Machado. (Modified by RW for multi-space removal)

static string DuplicateWhiteSpaceRemover(string str)
{
    var len = str.Length;
    var src = str.ToCharArray();
    int dstIdx = 0;
    bool lastWasWS = false; //Added line
    for (int i = 0; i < len; i++)
    {
        var ch = src[i];
        switch (ch)
        {
            case '\u0020': //SPACE
            case '\u00A0': //NO-BREAK SPACE
            case '\u1680': //OGHAM SPACE MARK
            case '\u2000': // EN QUAD
            case '\u2001': //EM QUAD
            case '\u2002': //EN SPACE
            case '\u2003': //EM SPACE
            case '\u2004': //THREE-PER-EM SPACE
            case '\u2005': //FOUR-PER-EM SPACE
            case '\u2006': //SIX-PER-EM SPACE
            case '\u2007': //FIGURE SPACE
            case '\u2008': //PUNCTUATION SPACE
            case '\u2009': //THIN SPACE
            case '\u200A': //HAIR SPACE
            case '\u202F': //NARROW NO-BREAK SPACE
            case '\u205F': //MEDIUM MATHEMATICAL SPACE
            case '\u3000': //IDEOGRAPHIC SPACE
            case '\u2028': //LINE SEPARATOR
            case '\u2029': //PARAGRAPH SEPARATOR
            case '\u0009': //[ASCII Tab]
            case '\u000A': //[ASCII Line Feed]
            case '\u000B': //[ASCII Vertical Tab]
            case '\u000C': //[ASCII Form Feed]
            case '\u000D': //[ASCII Carriage Return]
            case '\u0085': //NEXT LINE
                if (lastWasWS == false) //Added line
                {
                    src[dstIdx++] = ' '; // Updated by Ryan
                    lastWasWS = true; //Added line
                }
                continue;
            default:
                lastWasWS = false; //Added line 
                src[dstIdx++] = ch;
                break;
        }
    }
    return new string(src, 0, dstIdx);
}

The benchmarks...

|                           | Time  |   TEST 1    |   TEST 2    |   TEST 3    |   TEST 4    |   TEST 5    |
| Function Name             |(ticks)| dup. spaces | spaces+tabs | spaces+CR/LF| " " -> " "  | " " -> " " |
|---------------------------|-------|-------------|-------------|-------------|-------------|-------------|
| SwitchStmtBuildSpaceOnly  |   5.2 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| InPlaceCharArraySpaceOnly |   5.6 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| DuplicateWhiteSpaceRemover|   7.0 |    PASS     |    PASS     |    PASS     |    PASS     |    PASS     |
| SingleSpacedTrim          |  11.8 |    PASS     |    PASS     |    PASS     |    FAIL     |    FAIL     |
| Fubo(StringBuilder)       |    13 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| User214147                |    19 |    PASS     |    PASS     |    PASS     |    FAIL     |    FAIL     | 
| RegExWithCompile          |    28 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| SwitchStmtBuild           |    34 |    PASS     |    FAIL     |    FAIL     |    PASS     |    PASS     |
| SplitAndJoinOnSpace       |    55 |    PASS     |    FAIL     |    FAIL     |    FAIL     |    FAIL     |
| RegExNoCompile            |   120 |    PASS     |    PASS     |    PASS     |    PASS     |    PASS     |
| RegExBrandon              |   137 |    PASS     |    FAIL     |    PASS     |    PASS     |    PASS     |

Benchmark notes: Release Mode, no-debugger attached, i7 processor, avg of 4 runs, only short strings tested

SwitchStmtBuildSpaceOnly by Felipe Machado 2015 and modified by Sunsetquest

InPlaceCharArraySpaceOnly by Felipe Machado 2015 and modified by Sunsetquest

SwitchStmtBuild by Felipe Machado 2015 and modified by Sunsetquest

SwitchStmtBuild2 by Felipe Machado 2015 and modified by Sunsetquest

SingleSpacedTrim by David S 2013

Fubo(StringBuilder) by fubo 2014

SplitAndJoinOnSpace by Jon Skeet 2009

RegExWithCompile by Jon Skeet 2009

User214147 by user214147

RegExBrandon by Brandon

RegExNoCompile by Tim Hoolihan

Benchmark code is on Github

SunsetQuest
  • 8,041
  • 2
  • 47
  • 42
  • 2
    Nice to see my article referenced here! (I'm Felipe Machado) I'm about to update it using a proper benchmark tool called BenchmarkDotNet! I'll try to setup runs in all runtimes (now that we have DOT NET CORE and the likes... – Loudenvier May 01 '19 at 17:21
  • 2
    @Loudenvier - Nice work on this. Yours was the quickest by almost 400%! .Net Core is like a free 150-200% performance boost. It's getting closer to c++ performance but much easier to code. Thanks for the comment. – SunsetQuest May 02 '19 at 05:18
  • 2
    This only does spaces, not other white space characters. Maybe you want char.IsWhiteSpace(ch) instead of src[i] == '\u0020'. I notice this has been edited by the community. Did they bork it up? – Evil Pigeon Aug 06 '20 at 05:44
4

As already pointed out, this is easily done by a regular expression. I'll just add that you might want to add a .trim() to that to get rid of leading/trailing whitespace.

MAK
  • 26,140
  • 11
  • 55
  • 86
4

I'm sharing what I use, because it appears I've come up with something different. I've been using this for a while and it is fast enough for me. I'm not sure how it stacks up against the others. I uses it in a delimited file writer and run large datatables one field at a time through it.

    public static string NormalizeWhiteSpace(string S)
    {
        string s = S.Trim();
        bool iswhite = false;
        int iwhite;
        int sLength = s.Length;
        StringBuilder sb = new StringBuilder(sLength);
        foreach(char c in s.ToCharArray())
        {
            if(Char.IsWhiteSpace(c))
            {
                if (iswhite)
                {
                    //Continuing whitespace ignore it.
                    continue;
                }
                else
                {
                    //New WhiteSpace

                    //Replace whitespace with a single space.
                    sb.Append(" ");
                    //Set iswhite to True and any following whitespace will be ignored
                    iswhite = true;
                }  
            }
            else
            {
                sb.Append(c.ToString());
                //reset iswhitespace to false
                iswhite = false;
            }
        }
        return sb.ToString();
    }
user214147
  • 153
  • 1
  • 6
2

VB.NET

Linha.Split(" ").ToList().Where(Function(x) x <> " ").ToArray

C#

Linha.Split(" ").ToList().Where(x => x != " ").ToArray();

Enjoy the power of LINQ =D

Sebastian Hofmann
  • 1,440
  • 6
  • 15
  • 21
  • Exactly! To me this is the most elegant approach, too. So for the record, in C# that would be: `string.Join(" ", myString.Split(' ').Where(s => s != " ").ToArray())` – Efrain Oct 19 '16 at 12:14
  • 1
    Minor improvement on the `Split` to catch all whitespace and remove the `Where` clause: `myString.Split(null as char[], StringSplitOptions.RemoveEmptyEntries)` – David Feb 18 '17 at 18:35
2

Using the test program that Jon Skeet posted, I tried to see if I could get a hand written loop to run faster.
I can beat NormalizeWithSplitAndJoin every time, but only beat NormalizeWithRegex with inputs of 1000, 5.

static string NormalizeWithLoop(string input)
{
    StringBuilder output = new StringBuilder(input.Length);

    char lastChar = '*';  // anything other then space 
    for (int i = 0; i < input.Length; i++)
    {
        char thisChar = input[i];
        if (!(lastChar == ' ' && thisChar == ' '))
            output.Append(thisChar);

        lastChar = thisChar;
    }

    return output.ToString();
}

I have not looked at the machine code the jitter produces, however I expect the problem is the time taken by the call to StringBuilder.Append() and to do much better would need the use of unsafe code.

So Regex.Replace() is very fast and hard to beat!!

Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
1
Regex regex = new Regex(@"\W+");
string outputString = regex.Replace(inputString, " ");
Michael D.
  • 1,249
  • 2
  • 25
  • 44
  • This replaces all non-word characters with space. So it would also replace things like brackets and quotes etc, which might not be what you want. – Herman Oct 28 '15 at 09:35
0

Smallest solution:

var regExp=/\s+/g,
newString=oldString.replace(regExp,' ');
0

You can try this:

    /// <summary>
    /// Remove all extra spaces and tabs between words in the specified string!
    /// </summary>
    /// <param name="str">The specified string.</param>
    public static string RemoveExtraSpaces(string str)
    {
        str = str.Trim();
        StringBuilder sb = new StringBuilder();
        bool space = false;
        foreach (char c in str)
        {
            if (char.IsWhiteSpace(c) || c == (char)9) { space = true; }
            else { if (space) { sb.Append(' '); }; sb.Append(c); space = false; };
        }
        return sb.ToString();
    }
LL99
  • 21
  • 2
0

Replacement groups provide impler approach resolving replacement of multiple white space characters with same single one:

    public static void WhiteSpaceReduce()
    {
        string t1 = "a b   c d";
        string t2 = "a b\n\nc\nd";

        Regex whiteReduce = new Regex(@"(?<firstWS>\s)(?<repeatedWS>\k<firstWS>+)");
        Console.WriteLine("{0}", t1);
        //Console.WriteLine("{0}", whiteReduce.Replace(t1, x => x.Value.Substring(0, 1))); 
        Console.WriteLine("{0}", whiteReduce.Replace(t1, @"${firstWS}"));
        Console.WriteLine("\nNext example ---------");
        Console.WriteLine("{0}", t2);
        Console.WriteLine("{0}", whiteReduce.Replace(t2, @"${firstWS}"));
        Console.WriteLine();
    }

Please notice the second example keeps single \n while accepted answer would replace end of line with space.

If you need to replace any combination of white space characters with the first one, just remove the back-reference \k from the pattern.

Dan
  • 494
  • 2
  • 14
0
string.Join(" ", s.Split(" ").Where(r => r != ""));
0

Let me share my solution, based on already posted solutions + small changes. It works fast enough because of the inline (local) function + StringBuilder. And it does exactly what was asked: "collapse" all whitespace sequences to the single whitespace. Also, it trims the spaces in the beginning and in the end.

    [Theory]
    [InlineData("Test", "Test")]
    [InlineData(" Test", "Test")]
    [InlineData("Test  ", "Test")]
    [InlineData("  Test  ", "Test")]
    [InlineData(" Test,   test ", "Test, test")]
    public void NormalizeWhiteSpace(string source, string expected)
    {
        Assert.Equal(expected, source.NormalizeWhiteSpace());
    }

    public static string NormalizeWhiteSpace(this string str)
    {
        if (string.IsNullOrWhiteSpace(str))
            return null;

        var sbOut = new StringBuilder();

        var isWhiteSpace = false;
        var isWhiteSpaceInBeginning = false;
        for (var i = 0; i < str.Length; i++)
        {
            if (IsWhitespace(str[i]))
            {
                isWhiteSpace = true;
                if (i == 0)
                    isWhiteSpaceInBeginning = true;
            }
            else
            {
                if (isWhiteSpace)
                {
                    if (!isWhiteSpaceInBeginning)
                        sbOut.Append(' ');

                    isWhiteSpaceInBeginning = false;
                    isWhiteSpace = false;
                }

                sbOut.Append(str[i]);
            }
        }

        return sbOut.ToString();

        static bool IsWhitespace(char ch)
        {
            switch (ch)
            {
                case '\u0020':
                case '\u00A0':
                case '\u1680':
                case '\u2000':
                case '\u2001':
                case '\u2002':
                case '\u2003':
                case '\u2004':
                case '\u2005':
                case '\u2006':
                case '\u2007':
                case '\u2008':
                case '\u2009':
                case '\u200A':
                case '\u202F':
                case '\u205F':
                case '\u3000':
                case '\u2028':
                case '\u2029':
                case '\u0009':
                case '\u000A':
                case '\u000B':
                case '\u000C':
                case '\u000D':
                case '\u0085':
                    return true;
            }

            return false;
        }
    }
TimeCoder
  • 175
  • 1
  • 8
-1

There is no way built in to do this. You can try this:

private static readonly char[] whitespace = new char[] { ' ', '\n', '\t', '\r', '\f', '\v' };
public static string Normalize(string source)
{
   return String.Join(" ", source.Split(whitespace, StringSplitOptions.RemoveEmptyEntries));
}

This will remove leading and trailing whitespce as well as collapse any internal whitespace to a single whitespace character. If you really only want to collapse spaces, then the solutions using a regular expression are better; otherwise this solution is better. (See the analysis done by Jon Skeet.)

Community
  • 1
  • 1
Scott Dorman
  • 42,236
  • 12
  • 79
  • 110
  • 7
    If the regular expression is compiled and cached, I'm not sure that has more overhead than splitting and joining, which could create *loads* of intermediate garbage strings. Have you done careful benchmarks of both approaches before assuming that your way is faster? – Jon Skeet Aug 14 '09 at 20:04
  • 1
    whitespace is undeclared here – Tim Hoolihan Aug 14 '09 at 20:06
  • 3
    Speaking of overhead, why on earth are you calling `source.ToCharArray()` and then throwing away the result? – Jon Skeet Aug 14 '09 at 20:13
  • 2
    *And* calling `ToCharArray()` on the result of string.Join, only to create a new string... wow, for that to be in a post complaining of overhead is just remarkable. -1. – Jon Skeet Aug 14 '09 at 20:15
  • 1
    Oh, and assuming `whitespace` is `new char[] { ' ' }`, this will give the wrong result if the input string starts or ends with a space. – Jon Skeet Aug 14 '09 at 20:19
  • No, I've not done benchmarks, but I know there is higher overhead for RegEx compared to the Split and Join. From what it looks like Split and Join either use character buffers, treat the string as an array of characters or go through unsafe code to do pointer manipulations. – Scott Dorman Aug 14 '09 at 20:20
  • grrr...copied from a larger example...updated to reflect the comments. – Scott Dorman Aug 14 '09 at 20:24
  • "Knowing" there's a higher overhead for regexes isn't nearly as good as proving it with benchmarks. I'm running benchmarks now, and will post results soon. – Jon Skeet Aug 14 '09 at 20:34