19

I need to split a string at all whitespace, it should ONLY contain the words themselves.

How can I do this in vb.net?

Tabs, Newlines, etc. must all be split!

This has been bugging me for quite a while now, as my syntax highlighter I made completely ignores the first word in each line except for the very first line.

Johannes Rudolph
  • 35,298
  • 14
  • 114
  • 172
Cyclone
  • 17,939
  • 45
  • 124
  • 193
  • 1
    See also possible duplicate with SplitStringOptions to remove the extra whitespace. http://stackoverflow.com/questions/6111298/best-way-to-specify-whitespace-in-a-string-split-operation – goodeye Feb 01 '16 at 01:24

7 Answers7

30

String.Split() (no parameters) does split on all whitespace (including LF/CR)

Jimmy
  • 89,068
  • 17
  • 119
  • 137
  • Why didn't they include that as an overload lol? Thanks so much! – Cyclone Oct 13 '09 at 21:31
  • 2
    because it resolves to the Split(params char[]) overload, with an empty array. The documentation for that overload mentions this behavior. – Jimmy Oct 13 '09 at 22:14
  • 2
    CAUTION: As Johannes Rudolph mentions in his answer, if there are multiple whitespace characters in a row, String.Split will contain empty elements. That is why Rubens Farias answer is superior. – ToolmakerSteve Sep 03 '14 at 00:32
  • Adam Ralph's solution of using String.Split().Where(...) has quicker performance than the Regex solution. I posted test results below. – u8it Sep 10 '15 at 20:14
  • 4
    @ToolMakerSteve - to remove empty elements `String.Split(new char[] {}, StringSplitOptions.RemoveEmptyEntries)` – Joe Jan 12 '18 at 11:52
  • @Joe, VS says "identifier expected" for me. – Prof. Falken Aug 30 '18 at 08:20
  • 2
    @Joe great idea, thanks! Simpler: `line.Split((char[])null, StringSplitOptions.RemoveEmptyEntries)` – Michel de Ruiter Mar 25 '20 at 11:46
21

Try this:

Regex.Split("your string here", "\s+")
Rubens Farias
  • 57,174
  • 8
  • 131
  • 162
6

If you want to avoid regex, you can do it like this:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit"
    .Split()
    .Where(x => x != string.Empty)

Visual Basic equivalent:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit" _
    .Split() _
    .Where(Function(X$) X <> String.Empty)

The Where() is important since, if your string has multiple white space characters next to each other, it removes the empty strings that will result from the Split().

At the time of writing, the currently accepted answer (https://stackoverflow.com/a/1563000/49241) does not take this into account.

Adam Ralph
  • 29,453
  • 4
  • 60
  • 67
  • 3
    great solution. Not only does it avoid the need for a Regex reference but it's also quicker (see my post below). I'd like to add that I don't think VB makes use of the lambda operator "=>", so the VB version of this is a little different, I think like this: s.Split().Where(Function(x) x <> String.Empty) – u8it Sep 10 '15 at 19:54
  • Hey @u8it, I have added a VB .NET version to this answer. I just read your comment a few days after editing the answer!!! – Sreenikethan I May 13 '19 at 18:29
  • @Sree your edit is incorrect. The Visual Basic version is _not_ the equivalent of the C# version because it uses `String.IsNullOrWhiteSpace()` instead of the `!=` operator to compare with `String.Empty`. Can you please fix it? I don't know what the Visual Basic syntax for that is. – Adam Ralph May 14 '19 at 19:10
  • @Adam I have put a `Not`, as in `Not String.IsNullOrWhiteSpace(X))`… the Not operator negates a Boolean value. Is that what you were telling about? – Sreenikethan I May 15 '19 at 20:18
  • I've tried my edit with a sample string as well (before posting the edit), and it worked perfectly as asked by OP. Am I missing anything you said? – Sreenikethan I May 16 '19 at 02:19
  • @Sree you labelled your edit as a "Visual Basic equivalent". `Not String.IsNullOrWhiteSpace(X)` is not the Visual Basic equivalent of `x != string.Empty`. – Adam Ralph May 16 '19 at 21:19
  • Hmm, I see. At this point, they both basically do the same thing, but thank you for the clarification. I'll update it. – Sreenikethan I May 17 '19 at 04:21
  • @Sree I've made the appropriate edit. I tested the VB version in LINQPad to make sure it works. – Adam Ralph May 18 '19 at 10:36
2

So, after seeing Adam Ralph's post, I suspected his solution of being faster than the Regex solution. Just thought I'd share the results of my testing since I did find it was faster.


There are really two factors at play (ignoring system variables): number of sub-strings extracted (determined by number of delimiters), and total string length. The very simple scenario plotted below uses "A" as the sub-string delimited by two white space characters (a space followed by tab). This accentuates the effect of number of sub-strings extracted. I went ahead and did some multiple variable testing to arrive at the following general equations for my operating system.

Regex()
t = (28.33*SSL + 572)(SSN/10^6)

Split().Where()
t = (6.23*SSL + 250)(SSN/10^6)

Where t is execution time in milliseconds, SSL is average sub-string length, and SSN is number of sub-strings delimited in string.

These equations can also written as

t = (28.33*SL + 572*SSN)/10^6

and

t = (6.23*SL + 250*SSN)/10^6

where SL is total string length (SL = SSL * SSN)

Conclusion: The Split().Where() solution is faster than Regex(). The major factor is number of sub-strings, while string length plays a minor role. Performance gains are about 2x and 5x for the respective coefficients.


enter image description here


Here's my testing code (probably way more material than necessary, but it's set-up for getting the multi-variable data I talked about)

using System;
using System.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
using System.Windows.Forms;
namespace ConsoleApplication1
{
    class Program
    {
        public enum TestMethods {regex, split};
        [STAThread]
        static void Main(string[] args)
        {
            //Compare TestMethod execution times and output result information
            //to the console at runtime and to the clipboard at program finish (so that data is ready to paste into analysis environment)
            #region Config_Variables
            //Choose test method from TestMethods enumerator (regex or split)
            TestMethods TestMethod = TestMethods.split;
            //Configure RepetitionString
            String RepetitionString =  string.Join(" \t", Enumerable.Repeat("A",100));
            //Configure initial and maximum count of string repetitions (final count may not equal max)
            int RepCountInitial = 100;int RepCountMax = 1000 * 100;

            //Step increment to next RepCount (calculated as 20% increase from current value)
            Func<int, int> Step = x => (int)Math.Round(x / 5.0, 0);
            //Execution count used to determine average speed (calculated to adjust down to 1 execution at long execution times)
            Func<double, int> ExecutionCount = x => (int)(1 + Math.Round(500.0 / (x + 1), 0));
            #endregion

            #region NonConfig_Variables
            string s; 
            string Results = "";
            string ResultInfo; 
            double ResultTime = 1;
            #endregion

            for (int RepCount = RepCountInitial; RepCount < RepCountMax; RepCount += Step(RepCount))
            {
                s = string.Join("", Enumerable.Repeat(RepetitionString, RepCount));
                ResultTime = Test(s, ExecutionCount(ResultTime), TestMethod);
                ResultInfo = ResultTime.ToString() + "\t" + RepCount.ToString() + "\t" + ExecutionCount(ResultTime).ToString() + "\t" + TestMethod.ToString();
                Console.WriteLine(ResultInfo); 
                Results += ResultInfo + "\r\n";
            }
            Clipboard.SetText(Results);
        }
        public static double Test(string s, int iMax, TestMethods Method)
        {
            switch (Method)
            {
                case TestMethods.regex:
                    return Math.Round(RegexRunTime(s, iMax),2);
                case TestMethods.split:
                    return Math.Round(SplitRunTime(s, iMax),2);
                default:
                    return -1;
            }
        }
        private static double RegexRunTime(string s, int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = Regex.Split(s, @"\s+");
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
        private static double SplitRunTime(string s,int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = s.Split().Where(x => x != string.Empty);
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
    }
}
u8it
  • 3,956
  • 1
  • 20
  • 33
  • 1
    Nice amount of effort but both solutions are suboptimal. Just use `str.Split((char[])null, StringSplitOptions.RemoveEmptyEntries)` instead of filtering out the empty strings from the result. – György Kőszeg May 17 '20 at 11:00
  • That looks like a good option too. Did you compare it? I'm wondering what the performance comes down to, for instance, compiler optimization, maybe it's very similar in the IL. – u8it May 18 '20 at 00:08
  • 1
    Post-filtering by a `WhereIterator` definitely an additional cost. I created a quick [performance test](https://dotnetfiddle.net/cDB9bg). – György Kőszeg May 18 '20 at 10:02
2

String.Split() will split on every single whitespace, so the result will contain empty strings usually. The Regex solution Ruben Farias has given is the correct way to do it. I have upvoted his answer but I want to give a small addition, dissecting the regex:

\s is a character class that matches all whitespace characters.

In order to split the string correctly when it contains multiple whitespace characters between words, we need to add a quantifier (or repetition operator) to the specification to match all whitespace between words. The correct quantifier to use in this case is +, meaning "one or more" occurrences of a given specification. While the syntax "\s+" is sufficient here, I prefer the more explicit "[\s]+".

Johannes Rudolph
  • 35,298
  • 14
  • 114
  • 172
1

I found I used the solution as noted by Adam Ralph, plus the VB.NET comment below by P57, but with one odd exception. I found I had to add .ToList.ToArray on the end.

Like so:

.Split().Where(Function(x) x <> String.Empty).ToList.ToArray

Without that, I kept getting "Unable to cast object of type 'WhereArrayIterator`1[System.String]' to type 'System.String[]'."

Maculin
  • 61
  • 8
  • I was able to make this work fine with only: .Split().Where(Function(x) x <> String.Empty).ToArray – Taegost Aug 24 '16 at 15:12
  • You're welcome. I guess I should have also said at that time that it was using VS2013 and .Net 4.5.2, just in case it was a recent change. – Taegost Feb 21 '17 at 18:52
-1
Dim words As String = "This is a list of words, with: a bit of punctuation" + _
                          vbTab + "and a tab character." + vbNewLine
Dim split As String() = words.Split(New [Char]() {" "c, CChar(vbTab), CChar(vbNewLine) })
Ed S.
  • 122,712
  • 22
  • 185
  • 265