62

To make things simple:

string streamR = sr.ReadLine();  // sr.Readline results in:
                                 //                         one "two two"

I want to be able to save them as two different strings, remove all spaces EXCEPT for the spaces found between quotation marks. Therefore, what I need is:

string 1 = one
string 2 = two two

So far what I have found that works is the following code, but it removes the spaces within the quotes.

//streamR.ReadLine only has two strings
  string[] splitter = streamR.Split(' ');
    str1 = splitter[0];
    // Only set str2 if the length is >1
    str2 = splitter.Length > 1 ? splitter[1] : string.Empty;

The output of this becomes

one
two

I have looked into Regular Expression to split on spaces unless in quotes however I can't seem to get regex to work/understand the code, especially how to split them so they are two different strings. All the codes there give me a compiling error (I am using System.Text.RegularExpressions)

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
Teachme
  • 655
  • 1
  • 5
  • 10
  • It will probably be easier to write your own parser for this - regex is not suitable for this kind of logic. – Oded Feb 01 '13 at 21:07
  • What compiling error? What is the error message? On what line? – O. R. Mapper Feb 01 '13 at 21:09
  • Error 1 Could not find an implementation of the query pattern for source type 'System.Text.RegularExpressions.MatchCollection'. 'Cast' not found. Are you missing a reference to 'System.Core.dll' or a using directive for 'System.Linq'? – Teachme Feb 01 '13 at 21:12
  • 1
    the string.split executes what I want beautifully, besides the fact of my quotation problem – Teachme Feb 01 '13 at 21:16

8 Answers8

60
string input = "one \"two two\" three \"four four\" five six";
var parts = Regex.Matches(input, @"[\""].+?[\""]|[^ ]+")
                .Cast<Match>()
                .Select(m => m.Value)
                .ToList();
I4V
  • 34,891
  • 6
  • 67
  • 79
  • Am i missing a using?Error 1 'System.Text.RegularExpressions.MatchCollection' does not contain a definition for 'Cast' and no extension method 'Cast' accepting a first argument of type 'System.Text.RegularExpressions.MatchCollection' could be found – Teachme Feb 01 '13 at 21:59
  • @Teachme It requires `System.Linq` – I4V Feb 01 '13 at 22:05
  • this works, but it ignores the fact that white spaces within quotes class it as one token rather than two or more – Teachme Feb 01 '13 at 22:19
  • @Teachme I don't understand what you mean. An example maybe? – I4V Feb 01 '13 at 22:23
  • Erm, basically if `string input ="Apple \"iPhone four\"` your code result becomes "Apple", "iPhone", "four". I am trying to get "Apple", "iPhone four" in two different strings so i can store them – Teachme Feb 01 '13 at 22:36
  • @Teachme My code returns 2 strings, and sorry I still don't understand what you expect. (BTW your example string should end with `"`). – I4V Feb 01 '13 at 22:45
  • Sorry, i forgot to add the "! I am uploading a picture for another answer i will reply here too – Teachme Feb 01 '13 at 22:47
  • @Teachme, My code works, and `string input = "OneWord \"Two words\"";` returns 2 strings as expected.Sorry but, I don't want to continue this meaningless conversation. – I4V Feb 01 '13 at 23:18
  • Sorry, it is my fault, i did not implement it right. Your answer did help though. – Teachme Feb 01 '13 at 23:29
  • 2
    fails for command arg splitting: `test --file="some file.txt"` splits into 3 strings, not two. Expected output would be: `test` and `--file="some file.txt"` I'm no regex guru so can't fix it. :( – Mark May 06 '14 at 18:56
  • 4
    Found this regex which works: `Regex.Split(ConsoleInput, "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*) (?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");` at http://stackoverflow.com/a/4780801/953414 with an explanation of how it works. – Mark May 06 '14 at 19:22
  • How to split it with dots, question marks, exclamation marks etc. instead of spaces. I'm trying to get every sentence one by one except the ones inside of quotes. For example: Walked. Turned back. But why? And said "Hello world. Damn this string splitting things!" without a shame. – ErTR Jan 26 '16 at 00:33
  • 1
    @ErtürkÖztürk Sounds like you want to split by non-word characters instead of spaces, so replace the space in the regex by \W, which means any non-word character. – Timo Jun 17 '16 at 07:50
  • 2
    Please explain your answer, just giving a snippet of code isn't teaching anything, and this user will likely have to ask more questions in the future about the subject, at the very least, provide a reference for learning. – Jordan LaPrise Nov 08 '16 at 18:35
  • 1
    This chops the string up correctly, but unfortunately this still leaves the \" in the output. – John Stock May 01 '17 at 08:15
  • 1
    I made two changes others might find useful: `@"(['\""])(?.+?)\1|(?[^ ]+)"` and `Select(m=>m.Groups["value"].Value)`. The pattern change will let you add any delimiter to the look for but then it must match the closing delimiter (I use " and ' in the pattern above). It will also place the searched for value into the the `value` group so we fetch that value instead.. This gets rid of the delimiter's from the strings automatically since `.Trim()`ming would remove all the matching delimiter characters instead of just the actual delimiters. – James Mar 16 '18 at 00:56
  • if anyone wants to exclude double speech marks substitute this with above `@"(?<=[ ][\""]|^[\""])[^\""]+(?=[\""][ ]|[\""]$)|(?<=[ ]|^)[^\"" ]+(?=[ ]|$)"` – uosjead Dec 16 '19 at 11:02
45

You can even do that without Regex: a LINQ expression with String.Split can do the job.

You can split your string before by " then split only the elements with even index in the resulting array by .

var result = myString.Split('"')
                     .Select((element, index) => index % 2 == 0  // If even index
                                           ? element.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)  // Split the item
                                           : new string[] { element })  // Keep the entire item
                     .SelectMany(element => element).ToList();

For the string:

This is a test for "Splitting a string" that has white spaces, unless they are "enclosed within quotes"

It gives the result:

This
is
a
test
for
Splitting a string
that
has
white
spaces,
unless
they
are
enclosed within quotes

UPDATE

string myString = "WordOne \"Word Two\"";
var result = myString.Split('"')
                     .Select((element, index) => index % 2 == 0  // If even index
                                           ? element.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)  // Split the item
                                           : new string[] { element })  // Keep the entire item
                     .SelectMany(element => element).ToList();

Console.WriteLine(result[0]);
Console.WriteLine(result[1]);
Console.ReadKey();

UPDATE 2

How do you define a quoted portion of the string?

We will assume that the string before the first " is non-quoted.

Then, the string placed between the first " and before the second " is quoted. The string between the second " and the third " is non-quoted. The string between the third and the fourth is quoted, ...

The general rule is: Each string between the (2*n-1)th (odd number) " and (2*n)th (even number) " is quoted. (1)

What is the relation with String.Split?

String.Split with the default StringSplitOption (define as StringSplitOption.None) creates an list of 1 string and then add a new string in the list for each splitting character found.

So, before the first ", the string is at index 0 in the splitted array, between the first and second ", the string is at index 1 in the array, between the third and fourth, index 2, ...

The general rule is: The string between the nth and (n+1)th " is at index n in the array. (2)

The given (1) and (2), we can conclude that: Quoted portion are at odd index in the splitted array.

Cédric Bignon
  • 12,892
  • 3
  • 39
  • 51
  • That looks good, but is there a way to seperate the list into indidual strings? (the readline will only have two words at a time including those in quotes) – Teachme Feb 01 '13 at 21:58
  • 1
    @Teachme You can use just get _result[0]_ and _result[1]_. – Cédric Bignon Feb 01 '13 at 22:00
  • I had to move one of the brackets to remove an error but I am left with this:Error 1 A local variable named 'element' cannot be declared in this scope because it would give a different meaning to 'element', which is already used in a 'parent or current' scope to denote something else – Teachme Feb 01 '13 at 22:08
  • Righteo, it compiled all good, but it caught an exception :o my streamR.ReadLine comes from a socket stream- if i type one line i get "System.NullReferenceException: Object reference not set to an instance of an object (points to empty line but thats when this code starts). making the readline (WordOne "Word Two" just has an output of WordOne Word. Console.WriteLine(result[0]); Console.WriteLine(result[1]); <- is how i am displaying them. In fact, i may be able to modify this slightly to make it work :) – Teachme Feb 01 '13 at 22:25
  • Ah yes, silly me (long day) Give me a few minuits and i will upload a screen shot of what i mean – Teachme Feb 01 '13 at 22:43
  • 1
    @Teachme Can you try the code I've written in the _Update_ section of my post. What result do you have? – Cédric Bignon Feb 01 '13 at 22:44
  • I just realised my problem. It works as it is supposed to when the mystring has "WordOne \"Word Two\""; Meaning your code works perfect! What I probably failed to mention is the string is coming from a client-side cmd box, and this code is on the server connected through a network socketstream. The client cannot type in the back-slashes only C:\username\client>wordone "word two" That, is where it's going wrong, and it displays my problem in original post. Incase i forget to mention- thank you for your help so far – Teachme Feb 01 '13 at 22:57
  • @Teachme There is no need to escape the _"_ if the client enter it in a textbox. Escaping is only needed for the developper when writting a string in the code. – Cédric Bignon Feb 01 '13 at 22:59
  • its on windows cmd. a visual example may help [link](http://i46.tinypic.com/dxfjfp.png) see link for image. Now, on your updated code word for word i get the expected outcome. But that outcome needs to come from exactly what the client writes as shown in picture – Teachme Feb 01 '13 at 23:07
  • I am just wondering would it be due to the fact I am using "\r\n" to follow protocols causing the problem? – Teachme Feb 01 '13 at 23:21
  • What justifies the idea that the even indices are the non quoted portions? I see no reason why that ought to be true in general. – Eric Lippert Feb 02 '13 at 00:24
  • @EricLippert I've updated my post to explain why it is true in general. – Cédric Bignon Feb 02 '13 at 00:56
  • 1
    I'm still not following you. What if the entire string is quoted? Then there is only one string and it is at position zero, which is even. Why is this not an issue? – Eric Lippert Feb 02 '13 at 01:24
  • @EricLippert Can you write an example you describe? – Cédric Bignon Feb 02 '13 at 10:07
  • If the entire string is quoted, there will be three strings post-split, and strings 0 & 2 will be empty – Eamon Nerbonne Feb 02 '13 at 10:13
  • 1
    @EamonNerbonne Exactly, then, the quoted string is still at an odd index in the splitted array. – Cédric Bignon Feb 02 '13 at 10:16
  • yeah, I meant to reply to @EricLippert - it's not a problem, it works as expected. – Eamon Nerbonne Feb 02 '13 at 10:23
  • Well, I'm not familiar enough with LINQ expressions to understand fully what this does, and I hate using code I don't understand, but I couldn't seem to write this or an equivalent of this myself, and as Eamon so eloquently put it, "it works as expected." So, yeah, +1. – VoidKing Jul 09 '13 at 14:12
  • @CédricBignon what about the cases where we have an odd number of double quotes? In those cases I get back an empty element? To counter that I have personally manipulated the following in my code: .SelectMany(element => element.Where(x => x.Length > 0)) – Squazz Jan 05 '16 at 13:56
  • @CédricBignon I have problems with cases where I have the char we are looking for in the middle of the sentence. Say, if we are looking for the char " and the sentence is 'test of "multiple words' where I have an odd number of the char, it's still splitting the string in the following: test, of, multiple words. I had hoped that this would only hit occurrences where there was a start and an end character? – Squazz Feb 16 '16 at 08:46
16

As custom parser might be more suitable for this.

This is something I wrote once when I had a specific (and very strange) parsing requirement that involved parenthesis and spaces, but it is generic enough that it should work with virtually any delimiter and text qualifier.

public static IEnumerable<String> ParseText(String line, Char delimiter, Char textQualifier)
{

    if (line == null)
        yield break;

    else
    {
        Char prevChar = '\0';
        Char nextChar = '\0';
        Char currentChar = '\0';

        Boolean inString = false;

        StringBuilder token = new StringBuilder();

        for (int i = 0; i < line.Length; i++)
        {
            currentChar = line[i];

            if (i > 0)
                prevChar = line[i - 1];
            else
                prevChar = '\0';

            if (i + 1 < line.Length)
                nextChar = line[i + 1];
            else
                nextChar = '\0';

            if (currentChar == textQualifier && (prevChar == '\0' || prevChar == delimiter) && !inString)
            {
                inString = true;
                continue;
            }

            if (currentChar == textQualifier && (nextChar == '\0' || nextChar == delimiter) && inString)
            {
                inString = false;
                continue;
            }

            if (currentChar == delimiter && !inString)
            {
                yield return token.ToString();
                token = token.Remove(0, token.Length);
                continue;
            }

            token = token.Append(currentChar);

        }

        yield return token.ToString();

    } 
}

The usage would be:

var parsedText = ParseText(streamR, ' ', '"');
psubsee2003
  • 8,563
  • 8
  • 61
  • 79
  • 1
    This clearly is the best solution. But it is missing an } at the end! And runs in O(n) ! – mischka Aug 31 '17 at 15:45
  • 4
    @mischka you are right. You win for finding the syntax error that was undiscovered for 4+ years – psubsee2003 Dec 12 '17 at 22:31
  • This is the best solution. One problem: for blank lines, it returns a single empty string. I fixed this by replacing the final yield with `if (string.IsNullOrWhiteSpace(token.ToString())) yield break; else yield return token.ToString();` – MattDG Feb 22 '19 at 21:32
  • 1
    @MattDG that could also be addressed in the initial null check too. That null check could be replaced with `if (string.IsNullOrWhiteSpace(line)`. It depends on the needs of the app – psubsee2003 Feb 24 '19 at 17:27
  • Agreed. It's better to check for whitespace in the initial null check. I'm using that version now -- thanks! – MattDG Feb 28 '19 at 03:16
14

You can use the TextFieldParser class that is part of the Microsoft.VisualBasic.FileIO namespace. (You'll need to add a reference to Microsoft.VisualBasic to your project.):

string inputString = "This is \"a test\" of the parser.";

using (MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(inputString)))
{
    using (Microsoft.VisualBasic.FileIO.TextFieldParser tfp = new TextFieldParser(ms))
    {
        tfp.Delimiters = new string[] { " " };
        tfp.HasFieldsEnclosedInQuotes = true;
        string[] output = tfp.ReadFields();

        for (int i = 0; i < output.Length; i++)
        {
            Console.WriteLine("{0}:{1}", i, output[i]);
        }
    }
}

Which generates the output:

0:This
1:is
2:a test
3:of
4:the
5:parser.
DavidRR
  • 18,291
  • 25
  • 109
  • 191
John Koerner
  • 37,428
  • 8
  • 84
  • 134
  • 1
    No need to use `MemoryStream`, TextFieldParser has an overload which takes a TextReader, so you can simply pass `new StringReader(inputString)` to the constructor – Ghost4Man Jan 08 '17 at 09:32
  • The string constructor expects a path as the string, not the text to be parsed – John Koerner Jan 08 '17 at 13:55
  • 1
    I meant the overload that takes a `TextReader` (the `StringReader` subclass can be created from a string), the `TextFieldParser` reads the string from `TextReader`. Look at the msdn documentation for [TextFieldParser](https://msdn.microsoft.com/en-us/library/ms128084.aspx) and [StringReader](https://msdn.microsoft.com/en-us/library/system.io.stringreader.stringreader.aspx) constructors – Ghost4Man Jan 08 '17 at 22:10
  • This is the best option. Don't reinvent the wheel... – szamil Jul 06 '22 at 19:50
4

With support for double quotes.

String:

a "b b" "c ""c"" c"

Result:

a 
"b b"
"c ""c"" c"

Code:

var list=Regex.Matches(value, @"\""(\""\""|[^\""])+\""|[^ ]+", 
    RegexOptions.ExplicitCapture)
            .Cast<Match>()
            .Select(m => m.Value)
            .ToList();

Optional remove double quotes:

Select(m => m.StartsWith("\"") ? m.Substring(1, m.Length - 2).Replace("\"\"", "\"") : m)

Result

a 
b b
c "c" c
Kux
  • 1,362
  • 1
  • 16
  • 31
1

There's just a tiny problem with Squazz' answer.. it works for his string, but not if you add more items. E.g.

string myString = "WordOne \"Word Two\" Three"

In that case, the removal of the last quotation mark would get us 4 results, not three.

That's easily fixed though.. just count the number of escape characters, and if it's uneven, strip the last (adapt as per your requirements..)

    public static List<String> Split(this string myString, char separator, char escapeCharacter)
    {
        int nbEscapeCharactoers = myString.Count(c => c == escapeCharacter);
        if (nbEscapeCharactoers % 2 != 0) // uneven number of escape characters
        {
            int lastIndex = myString.LastIndexOf("" + escapeCharacter, StringComparison.Ordinal);
            myString = myString.Remove(lastIndex, 1); // remove the last escape character
        }
        var result = myString.Split(escapeCharacter)
                             .Select((element, index) => index % 2 == 0  // If even index
                                                   ? element.Split(new[] { separator }, StringSplitOptions.RemoveEmptyEntries)  // Split the item
                                                   : new string[] { element })  // Keep the entire item
                             .SelectMany(element => element).ToList();
        return result;
    }

I also turned it into an extension method and made separator and escape character configurable.

user3566056
  • 224
  • 1
  • 12
0

OP wanted to

... remove all spaces EXCEPT for the spaces found between quotation marks

The solution from Cédric Bignon almost did this, but didn't take into account that there could be an uneven number of quotation marks. Starting out by checking for this, and then removing the excess ones, ensures that we only stop splitting if the element really is encapsulated by quotation marks.

string myString = "WordOne \"Word Two";
int placement = myString.LastIndexOf("\"", StringComparison.Ordinal);
if (placement >= 0)
myString = myString.Remove(placement, 1);

var result = myString.Split('"')
                     .Select((element, index) => index % 2 == 0  // If even index
                                           ? element.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)  // Split the item
                                           : new string[] { element })  // Keep the entire item
                     .SelectMany(element => element).ToList();

Console.WriteLine(result[0]);
Console.WriteLine(result[1]);
Console.ReadKey();

Credit for the logic goes to Cédric Bignon, I only added a safeguard.

Squazz
  • 3,912
  • 7
  • 38
  • 62
0

I used these patterns:

Without including quotes (single and double) and without positive lookbehind:

pattern = "/[^''\""]+(?=[''\""][ ]|[''\""]$)|[^''\"" ]+(?=[ ]|$)/gm"

Without including quotes (single and double) and with positive lookbehind:

pattern = "/(?<=[ ][''\""]|^[''\""])[^''\""]+(?=[''\""][ ]|[''\""]$)|(?<=[ ]|^)[^''\"" ]+(?=[ ]|$)/gm"

Including quotes (single and double) and without positive lookbehind:

pattern = "/[''].+?['']|[\""].+?[\""]|[^ ]+/gm"

Tested here:

  1. regex101
  2. regexr
Riccardo Volpe
  • 1,471
  • 1
  • 16
  • 30