70

I would like to use the .Net Regex.Split method to split this input string into an array. It must split on whitespace unless it is enclosed in a quote.

Input: Here is "my string"    it has "six  matches"

Expected output:

  1. Here
  2. is
  3. my string
  4. it
  5. has
  6. six  matches

What pattern do I need? Also do I need to specify any RegexOptions?

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
Shaun Bowe
  • 9,840
  • 11
  • 50
  • 71

12 Answers12

67

No options required

Regex:

\w+|"[\w\s]*"

C#:

Regex regex = new Regex(@"\w+|""[\w\s]*""");

Or if you need to exclude " characters:

    Regex
        .Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));
Bartek Szabat
  • 2,904
  • 1
  • 20
  • 13
  • VERY CLOSE! Now all I need is to preserve the whitespace in the matches. – Shaun Bowe Feb 16 '09 at 18:18
  • 7
    If anyone's interested, this is a modified version of Bartek's regex which works for non-word characters (eg full stops, commas and brackets): [^\s"]+|"[^"]*" – Joel Rein Nov 03 '11 at 06:22
  • @Jivlain Very useful, thanks! It indeed matches words like `(test`, only `("test test"` has two words: `(` and `"test test"`. Is there any way to fix this? Thanks! – Luc Oct 15 '12 at 14:11
  • 3
    I got it: `([^\s]*"[^"]+"[^\s]*)|[^"]?\w+[^"]?` Now the only problem is that it doesn't work in Javascript :/ But that's offtopic here. – Luc Oct 15 '12 at 14:45
  • 3
    what if string can contain quotes, like "something "" some other thing" – Arsen Mkrtchyan Aug 29 '13 at 11:42
  • 1
    A modified version, which will allow both single and double quote delimiters: `@"(?\w+)|\""(?[\w\s]*)""|'(?[\w\s]*)'"` – StuartN Mar 09 '18 at 11:35
  • @StuarN How would I use your Regex and modify it so that hyphens are part pf the word, and not treated as whitespace? – Y Haber Nov 12 '19 at 01:33
  • I ended up using this: ```var pattern = @"(?(?<="")[^\s]([^""]+)(?=""))|(?(?<=')[^\s]([^']+(?=')))|(?[A-Za-z0-9\-]+)";``` – Y Haber Nov 12 '19 at 19:44
18

Lieven's solution gets most of the way there, and as he states in his comments it's just a matter of changing the ending to Bartek's solution. The end result is the following working regEx:

(?<=")\w[\w\s]*(?=")|\w+|"[\w\s]*"

Input: Here is "my string" it has "six matches"

Output:

  1. Here
  2. is
  3. "my string"
  4. it
  5. has
  6. "six matches"

Unfortunately it's including the quotes. If you instead use the following:

(("((?<token>.*?)(?<!\\)")|(?<token>[\w]+))(\s)*)

And explicitly capture the "token" matches as follows:

    RegexOptions options = RegexOptions.None;
    Regex regex = new Regex( @"((""((?<token>.*?)(?<!\\)"")|(?<token>[\w]+))(\s)*)", options );
    string input = @"   Here is ""my string"" it has   "" six  matches""   ";
    var result = (from Match m in regex.Matches( input ) 
                  where m.Groups[ "token" ].Success
                  select m.Groups[ "token" ].Value).ToList();

    for ( int i = 0; i < result.Count(); i++ )
    {
        Debug.WriteLine( string.Format( "Token[{0}]: '{1}'", i, result[ i ] ) );
    }

Debug output:

Token[0]: 'Here'
Token[1]: 'is'
Token[2]: 'my string'
Token[3]: 'it'
Token[4]: 'has'
Token[5]: ' six  matches'
Timothy Walters
  • 16,866
  • 2
  • 41
  • 49
  • i need a regexp for javascript split() function for splitting words on white space except for those in quotes. i couldnt use the one you wrote, do you know how to write one in javascript? – ajsie Dec 29 '09 at 07:05
  • To change this so that it counts other symbols as words just change the [\w] to match. It was splitting on decimal points so I changed it to [\w.] and now it splits properly. – Sean Dawson Jan 25 '12 at 06:33
  • So how does the regex look like when not splitting with Colons or : ? – Boas Enkler Jul 28 '12 at 23:00
10

The top answer doesn't quite work for me. I was trying to split this sort of string by spaces, but it looks like it splits on the dots ('.') as well.

"the lib.lib" "another lib".lib

I know the question asks about regexs, but I ended up writing a non-regex function to do this:

    /// <summary>
    /// Splits the string passed in by the delimiters passed in.
    /// Quoted sections are not split, and all tokens have whitespace
    /// trimmed from the start and end.
    public static List<string> split(string stringToSplit, params char[] delimiters)
    {
        List<string> results = new List<string>();

        bool inQuote = false;
        StringBuilder currentToken = new StringBuilder();
        for (int index = 0; index < stringToSplit.Length; ++index)
        {
            char currentCharacter = stringToSplit[index];
            if (currentCharacter == '"')
            {
                // When we see a ", we need to decide whether we are
                // at the start or send of a quoted section...
                inQuote = !inQuote;
            }
            else if (delimiters.Contains(currentCharacter) && inQuote == false)
            {
                // We've come to the end of a token, so we find the token,
                // trim it and add it to the collection of results...
                string result = currentToken.ToString().Trim();
                if (result != "") results.Add(result);

                // We start a new token...
                currentToken = new StringBuilder();
            }
            else
            {
                // We've got a 'normal' character, so we add it to
                // the curent token...
                currentToken.Append(currentCharacter);
            }
        }

        // We've come to the end of the string, so we add the last token...
        string lastResult = currentToken.ToString().Trim();
        if (lastResult != "") results.Add(lastResult);

        return results;
    }
Richard Shepherd
  • 1,300
  • 17
  • 20
  • 3
    I hope this answer isn't deemed off topic as it's a non-regex function. I found this question while looking for the more general topic of how to split a string while preserving quotes, rather than the more specific question about regexes. – Richard Shepherd Dec 19 '11 at 22:02
  • 1
    this is a lot more clear than figuring out a particular c# flavored regex solution. – timc Oct 09 '13 at 01:50
  • This is what I wanted! Worked awesome! – Merin Nakarmi Oct 17 '18 at 16:46
8

I was using Bartek Szabat's answer, but I needed to capture more than just "\w" characters in my tokens. To solve the problem, I modified his regex slightly, similar to Grzenio's answer:

Regular Expression: (?<match>[^\s"]+)|(?<match>"[^"]*")

C# String:          (?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")

Bartek's code (which returns tokens stripped of enclosing quotes) becomes:

Regex
        .Matches(input, "(?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));
Boinst
  • 3,365
  • 2
  • 38
  • 60
6

I have found the regex in this answer to be quite useful. To make it work in C# you will have to use the MatchCollection class.

//need to escape \s
string pattern = "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'";

MatchCollection parsedStrings = Regex.Matches(line, pattern);

for (int i = 0; i < parsedStrings.Count; i++)
{
    //print parsed strings
    Console.Write(parsedStrings[i].Value + " ");
}
Console.WriteLine();
Community
  • 1
  • 1
Syed Ali
  • 1,817
  • 2
  • 23
  • 44
4

This regex will split based on the case you have given above, although it does not strip the quotes or extra spaces, so you may want to do some post processing on your strings. This should correctly keep quoted strings together though.

"[^"]+"|\s?\w+?\s
John Conrad
  • 305
  • 1
  • 2
  • 7
  • Thanks for the answer. This is very close. Close enough that I will use it for now. I will leave the question open for a day or so to see if there is a more complete answer. Otherwise I will accept this. – Shaun Bowe Feb 16 '09 at 18:17
  • "([^"]+)"|\s?(\w+?)\s will return "-stripped strings – f3lix Feb 16 '09 at 18:36
2

With a little bit of messiness, regular languages can keep track of even/odd counting of quotes, but if your data can include escaped quotes (\") then you're in real trouble producing or comprehending a regular expression that will handle that correctly.

Liudvikas Bukys
  • 5,790
  • 3
  • 25
  • 36
1

EDIT: Sorry for my previous post, this is obviously possible.

To handle all the non-alphanumeric characters you need something like this:

MatchCollection matchCollection = Regex.Matches(input, @"(?<match>[^""\s]+)|\""(?<match>[^""]*)""");
foreach (Match match in matchCollection)
        {
            yield return match.Groups["match"].Value;
        }

you can make the foreach smarter if you are using .Net >2.0

Grzenio
  • 35,875
  • 47
  • 158
  • 240
1

Shaun,

I believe the following regex should do it

(?<=")\w[\w\s]*(?=")|\w+  

Regards,
Lieven

Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
0

If you'd like to take a look at a general solution to this problem in the form of a free, open-source javascript object, you can visit http://splitterjsobj.sourceforge.net/ for a live demo (and download). The object has the following features:

  • Pairs of user-defined quote characters can be used to escape the delimiter (prevent a split inside quotes). The quotes can be escaped with a user-defined escape char, and/or by "double quote escape." The escape char can be escaped (with itself). In one of the 5 output arrays (properties of the object), output is unescaped. (For example, if the escape char = /, "a///"b" is unescaped as a/"b)
  • Split on an array of delimiters; parse a file in one call. (The output arrays will be nested.)
  • All escape sequences recognized by javascript can be evaluated during the split process and/or in a preprocess.
  • Callback functionality
  • Cross-browser consistency

The object is also available as a jQuery plugin, but as a new user at this site I can only include one link in this message.

Brian W
  • 1
  • 1
  • Wait, what? The OP is asking about .NET regexp. Is this a commercial for your lib, or was there a way you thought it'd integrate into .NET easily? – ruffin Jul 01 '12 at 00:20
0

Take a look at LSteinle's "Split Function that Supports Text Qualifiers" over at Code project

Here is the snippet from his project that you’re interested in.

using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
    string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))", 
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

Just watch out for calling this in a loop as its creating and compiling the Regex statement every time you call it. So if you need to call it more then a handful of times, I would look at creating a Regex cache of some kind.

Adam Larsen
  • 1,121
  • 1
  • 8
  • 13
0

I need to support nesting so none of these worked for me. I gave up trying to do it via Regex and just coded:

  public static Argument[] ParseCmdLine(string args) {
    List<string> ls = new List<string>();
    StringBuilder sb = new StringBuilder(128);

    // support quoted text nesting up to 8 levels deep
    Span<char> quoteChar = stackalloc char[8];
    int quoteLevel = 0;
      
    for (int i = 0; i < args.Length; ++i) {
      char ch = args[i];
      switch (ch) {
        case ' ':
          if (quoteLevel == 0) {
            ls.Add(sb.ToString());
            sb.Clear();
            break;
          } 
          goto default; 
        case '"':
        case '\'':
          if (quoteChar[quoteLevel] == ch) {
            --quoteLevel;
          } else {
            quoteChar[++quoteLevel] = ch;
          }
          goto default; 
        default:
          sb.Append(ch);
          break;
      }
    }
    if (sb.Length > 0) { ls.Add(sb.ToString()); sb.Clear(); }

    return Arguments.ParseCmdLine(ls.ToArray());
  }

And here's some additional code to parse the command line arguments to objects:

  public struct Argument {
    public string Prefix;
    public string Name;
    public string Eq;
    public string QuoteType;
    public string Value;

    public string[] ToArray() => this.Eq == " " ? new string[] { $"{Prefix}{Name}", $"{QuoteType}{Value}{QuoteType}" } : new string[] { this.ToString() };
    public override string ToString() => $"{Prefix}{Name}{Eq}{QuoteType}{Value}{QuoteType}";
  }

  private static readonly Regex RGX_MatchArg = new Regex(@"^(?<prefix>-{1,2}|\/)(?<name>[a-zA-Z][a-zA-Z_-]*)(?<assignment>(?<eq>[:= ]|$)(?<quote>[""'])?(?<value>.+?)(?:\k<quote>|\s*$))?");
  private static readonly Regex RGX_MatchQuoted = new Regex(@"(?<quote>[""'])?(?<value>.+?)(?:\k<quote>|\s*$)");

  public static Argument[] ParseCmdLine(string[] rawArgs) {
    int count = 0;
    Argument[] pairs = new Argument[rawArgs.Length];

    int i = 0;
    while(i < rawArgs.Length) {
      string current = rawArgs[i];
      i+=1;
      Match matches = RGX_MatchArg.Match(current);
      Argument arg = new Argument();
      arg.Prefix = matches.Groups["prefix"].Value;
      arg.Name = matches.Groups["name"].Value;
      arg.Value = matches.Groups["value"].Value;
      if(!string.IsNullOrEmpty(arg.Value)) {
        arg.Eq = matches.Groups["eq"].Value;
        arg.QuoteType = matches.Groups["quote"].Value;
      } else if ((i < rawArgs.Length) && !rawArgs[i].StartsWith('-') && !rawArgs[i].StartsWith('/')) {
        arg.Eq = " ";
        Match quoted = RGX_MatchQuoted.Match(rawArgs[i]);
        arg.QuoteType = quoted.Groups["quote"].Value;
        arg.Value = quoted.Groups["value"].Value;
        i+=1;
      }
      if(string.IsNullOrEmpty(arg.QuoteType) && arg.Value.IndexOfAny(new char[] { ' ', '/', '\\', '-', '=', ':' }) >= 0) {
        arg.QuoteType = "\"";
      }
      pairs[count++] = arg;
    }

    return pairs.Slice(0..count);
  }

  public static ILookup<string, Argument> ToLookup(this Argument[] args) => args.ToLookup((arg) => arg.Name, StringComparer.OrdinalIgnoreCase);
}

It's able to parse all different kinds of argument variants:

-test -environment staging /DEqTest=avalue /Dcolontest:anothervalue /DwithSpaces="heys: guys" /slashargflag -action="Do: 'The Thing'" -action2 "do: 'Do: \"The Thing\"'" -init

Nested quotes just need to be alternated between different quote types.

Derek Ziemba
  • 2,467
  • 22
  • 22