0

I've seen a lot of Regex answers that get very close to what I need, but it's not quite there. The problem is that I have a string that I need to split on a character (e.g.: space or '=') but I want to ignore anything that is inside of quotes (even quotes inside of quotes).

The closest I've been able to get is this:

" (?=(?:[^"]*"[^"]*")*[^"]*$)"

Which works great, with two caveats: poorly timed spaces in the quotes trigger a bad split, and it reads backwards. The first problem I don't really care about, there's not much I can do and I can work around it. But the second is critical.

The case is that sometimes the string I'm regexing may be accidentally missing a quote on the end. This doesn't really bother my system, but the regex above goes backward, so it breaks everything:

string test = "foo bar \"foo bar\" foobar \"foo"
var result = Regex.Split(test, " (?=(?:[^"]*"[^"]*")*[^"]*$)");

This will make:

foo bar "foo
bar" foobar "foo

Because it starts at the end and runs the filter backwards. I need the result to be:

foo
bar
"foo bar"
foobar
"foo

I know the $ is responsible for the start at the end thing, but I can't for the life of me figure out how to reverse it. Thoughts?

Community
  • 1
  • 1
Dan B
  • 357
  • 1
  • 2
  • 15
  • 1
    possible duplicate of [Split a string that has white spaces, unless they are enclosed within "quotes"?](http://stackoverflow.com/questions/14655023/split-a-string-that-has-white-spaces-unless-they-are-enclosed-within-quotes) – dav_i Oct 29 '13 at 17:40

4 Answers4

1

You can use this regex when splitting.

("[^"]+"|\s+)

Most splitting function will return the delimiter used if you enclose the pattern inside parentheses. In this case you first try to match a word withing quotes at your current position, if you can't match that, you opt to match spaces.

Once you have all the values, just get rid of those that only contains the delimiter you want to discard (space in this case).

Here is a sample using Perl.

use warnings;
use strict;

my $string = "foo bar \"foo bar\" foobar \"foo";

my @array =  grep { ! /^\s*$/ } # Discard matches containing only spaces.
                 split /("[^"]+"|\s+)/, $string; # Split on whitespace or character withing quotes
                                         # Return delimiters as part of the match.    

print "$_\n" foreach @array;

OUTPUT

foo
bar
"foo bar"
foobar
"foo
edi_allen
  • 1,878
  • 1
  • 11
  • 8
  • Yours seems to have provided a workable solution. It did separate them in a way I could use, but the output wasn't quite as elegant and clean as Alan Moore's, so I'm going to go with his. But thank you for your good example as well. – Dan B Oct 30 '13 at 17:25
1

It doesn't actually run backward, it's just that the lookahead has to match all the way to the end each time it's applied. That's the only way it can be sure there's an even number of quotes following the current position.

But that's a hackish solution anyway; something you should do only if you're being forced to use Split(). It's usually much easier to match the tokens themselves. For example:

string s = @"foo bar ""foo bar"" foobar ""foo";
Regex r = new Regex(@"[^""\s]+|""[^""]+(?:""|$)");

foreach (Match m in r.Matches(s))
{
  Console.WriteLine(m.Value);
}

output:

foo
bar
"foo bar"
foobar
"foo

edit: This version allows unquoted tokens to contain quotes:

@"[^""\s]\S+|""[^""]+(?:""|$)"

I'm still assuming unquoted tokens can't contain any whitespace.


edit: It seems quotes are special all the time, not just when they're the first non-whitespace character in a token. In this version, a token may start or end with non-quotes and may contain one or more quoted sequences. Because everything is optional, it starts with a lookahead that prevents it from matching an empty string.

@"(?=\S)[^\s""]*(?:""[^""]+(?:$|""[^\s""]*))*"

As before, the final closing quote is optional.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • This was the best answer for my purposes. It isn't really for splitting, but that's perfectly fine, and might end up being better anyway. This provided the cleanest correct output. Thank you, Alan! – Dan B Oct 30 '13 at 17:27
  • There actually is 1 problem with this that I'd like to try to work around. In this example, it matches quoted things as their own match. So: `foo bar "foo bar" foo"bar" "foo` Becomes: `foo bar "foo bar" foo "bar" "foo` My goal would be that foo"bar" is one match, because there's no space. But this regex splits foo"bar" up into two. – Dan B Oct 30 '13 at 17:46
  • Please don't hate me. That's really close again, but still hitting a wall. `foo bar foobar "foobar" "foo bar" foo"bar" foo"b ar"` Results in the last part being `foo"b` and then `ar"`. Also you are awesome for sticking with me, Alan. – Dan B Oct 31 '13 at 16:26
  • There you go. If that's not sufficient, I don't think I'll be able to help. We're already well beyond what can *reasonably* be done with a regex. ;} – Alan Moore Nov 01 '13 at 03:20
  • Miracle work, sir. You've just done miracle work. Thank you. – Dan B Nov 01 '13 at 13:54
0

what if you tried this approach instead

string test = "foo bar \"foo bar\" foobar \"foo";
if (test.Count(q => q == '"')%2 == 1)
    test += "\"";

test = Regex.Replace(test, "\"[^\"]+\"", "");

Test if it has an odd number of quotes, add one if it does. Then remove anything inside quotes using "\"[^\"]+\"". Then you are free to split it simply using String.Split()

Jonesopolis
  • 25,034
  • 12
  • 68
  • 112
  • This could work as a hack solution, which I was already doing, but I was looking for a elegant Regex solution. – Dan B Oct 30 '13 at 17:26
0

I think Regex 1 or Regex 2 should do the trick.

 # =====================================
 # Regex 1
 # =====================================
 #    ("[^"]")|[\s=]+             // raw
 #    "(\"[^\"]\")|[\\s=]+"       // escped
 #    @"                          // verbatim
 #     (""[^""]"")|[\s=]+
 #    "
 # -------------------------------------
 #    
 #         ( " [^"] " )      # expanded Regex 1
 #      |  
 #         [\s=]+ 

 # =====================================
 # Regex 2
 # =====================================
 #    ("(?:[^"]*"[^"]*")*[^"]*")|[\s=]+             // raw
 #    "(\"(?:[^\"]*\"[^\"]*\")*[^\"]*\")|[\\s=]+"   // escaped
 #    @"                                            // verbatim
 #     (""(?:[^""]*""[^""]*"")*[^""]*"")|[\s=]+
 #    "
 # -------------------------------------
 #        
 #        (                  # expanded Regex 2
 #             " 
 #             (?: [^"]* " [^"]* " )*
 #             [^"]* 
 #             "     
 #        )
 #     |  
 #        [\s=]+ 
  • Regex 1 didn't ignore spaces in quotes. Regex 2 did, but it didn't understand how to know when quotes ended for a section. But I appreciate your input. – Dan B Oct 30 '13 at 17:22