1

So, what I need to do in c# regex is basically split a string whenever I find a certain pattern, but ignore that pattern if it is surrounded by double quotes in the string.

Example:

string text = "abc , def , a\" , \"d , oioi";
string pattern = "[ \t]*,[ \t]*";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (3 splits, 4 strings):

    {"abc",
     "def",
     "a\" , \"d",
     "oioi"}

Actual result (4 splits, 5 strings):

    {"abc",
     "def",
     "a\"",
     "\"d",
     "oioi"}

Another example:

string text = "a%2% 6y % \"ad%t6%&\" %(7y) %";
string pattern = "%";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (5 splits, 6 strings):

    {"a",
     "2",
     " 6y ",
     " \"ad%t6%&\" ",
     "(7y) ",
     ""}

Actual result (7 splits, 8 strings):

    {"a",
     "2",
     " 6y ",
     "\"ad",
     "t6",
     "&\" ",
     "(7y) ",
     ""}

A 3rd example, to exemplify a tricky split where only the first case should be ignored:

string text = "!!\"!!\"!!\"";
string pattern = "!!";

string[] result = Regex.Split(text, pattern, RegexOptions.ECMAScript);

Wanted result after split (2 splits, 3 strings):

    {"",
     "\"!!\"",
     "\""}

Actual result (3 splits, 4 strings):

    {"",
     "\"",
     "\"",
     "\"",}

So, how do I move from pattern to a new pattern that achieves the desired result?

Sidenote: If you're going to mark someone's question as duplicate (and I have nothing against that), at least point them to the right answer, not to some random post (yes, I'm looking at you, Mr. Avinash Raj)...

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
  • did you delete your previous question? – Avinash Raj Sep 26 '15 at 11:50
  • 1
    You have re-posted the same question as was closed as a duplicate of [this one](http://stackoverflow.com/q/3147836/335858). If you think your question is **not** a duplicate of this question, please explain it in the body of the question. – Sergey Kalinichenko Sep 26 '15 at 11:50
  • @AvinashRaj [Yes, he did](http://stackoverflow.com/q/32796029/335858). – Sergey Kalinichenko Sep 26 '15 at 11:51
  • https://regex101.com/r/rB6hG3/1 – Avinash Raj Sep 26 '15 at 11:53
  • Yes, I did delete it, because no amount of edits will "UnDuplicate" it. As you can see by the third example, the answer provided in [here](http://stackoverflow.com/questions/3147836/c-sharp-regex-split-commas-outside-quotes), which is the same as [here](https://regex101.com/r/rB6hG3/1), will not work. – Miguel Noronha Sep 26 '15 at 12:17
  • Could you explain *how do I move from pattern to a new pattern*? Do you mean to ask to fix each of the 3 regexes? 3-in-1 question? Or do you want to combine them into 1? – Wiktor Stribiżew Sep 26 '15 at 12:41
  • If you have a string `abc , def , a", "d , oioi` the result between `a", "d` would not be `a` and `d` because the comma is within the quote which you said **ignore**. The result would be `ad`. I recommend you redefine the requirements; they seem to conflict with the result examples you gave. – ΩmegaMan Sep 26 '15 at 13:14
  • [First](http://regexstorm.net/tester?p=%2c(%3f%3d(%3f%3a%5b%5e%22%5d*%22%5b%5e%22%5d*%22)*%5b%5e%22%5d*%24)&i=abc+%2c+def+%2c+a%22+%2c+%22d+%2c+oioi&o=e) and [second](http://regexstorm.net/tester?p=%25(%3f%3d(%3f%3a%5b%5e%22%5d*%22%5b%5e%22%5d*%22)*%5b%5e%22%5d*%24)&i=a%252%25+6y+%25+%22ad%25t6%25%26%22+%25(7y)+%25&o=e) can be considered duplicates of the question mentioned in the deleted post. The 3rd one contains 3 double quotes, and it is difficult to tell what is really *inside* them. – Wiktor Stribiżew Sep 26 '15 at 13:34

2 Answers2

2

The rules are more or less like in a csv line except that:

  • the delimiter can be a single character, but it can be a string or a pattern too (in these last cases items must be trimmed if they start or end with the last or first possible tokens of the pattern delimiter),
  • an orphan quote is allowed for the last item.

First, when you want to separate items (to split) with a little advanced rules, the split method is no more a good choice. The split method is only handy for simple situations, not for your case. (even without orphan quotes, using split with ,(?=(?:[^"]*"[^"]*")*[^"]*$) is a very bad idea since the number of steps needed to parse the string grows exponentially with the string size.)

The other approach consists to capture items. That is more simple and faster. (bonus: it checks the format of the whole string at the same time).

Here is a general way to do it:

^
(?>
  (?:delimiter | start_of_the_string)
  (
      simple_part
      (?>
          (?: quotes | delim_first_letter_1 | delim_first_letter_2 | etc. )
          simple_part
      )*
  )
)+
$

Example with \s*,\s* as delimiter:

^
# non-capturing group for one delimiter and one item
(?>
    (?: \s*,\s* | ^ ) # delimiter or start of the string
                      # (eventually change "^" to "^ \s*" to trim the first item)

    # capture group 1 for the item 
    (   # simple part of the item (maybe empty):
        [^\s,"]* # all that is not the quote character or one of the  possible first
                 # character of the delimiter
        # edge case followed by a simple part
        (?>
            (?: # edge cases
                " [^"]* (?:"|$) # a quoted part or an orphan quote in the last item (*)
              |   # OR
                (?> \s+ ) # start of the delimiter
                (?!,)     # but not the delimiter
            )

            [^\s,"]* # simple part
        )*
    )
)+
$

demo (click on the table link)

The pattern is designed for the Regex.Match method since it describes all the string. All items are available in group 1 since the .net regex flavor is able to store repeated capture groups.

This example can be easily adapted to all cases.

(*) if you want to allow escaped quotes inside quoted parts, you can use one more time simple_part (?: edge_case simple_part)* instead of " [^"]* (?:"|$),
i.e: "[^\\"]* (?: \\. [^\\"]*)* (?:"|$)

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

I think this is a two step process and it has been overthought trying to make it a one step regex.


Steps

  1. Simply remove any quotes from a string.
  2. Split on the target character(s).

Example of Process

I will split on the , for step 2.

var data = string.Format("abc , def , a{0}, {0}d , oioi", "\"");

 // `\x22` is hex for a quote (") which for easier reading in C# editing.
var stage1 = Regex.Replace(data, @"\x22", string.Empty);

// abc , def , a", "d , oioi
// becomes
// abc , def , a, d , oioi

Regex.Matches(stage1, @"([^\s,]+)[\s,]*")
     .OfType<Match>()
     .Select(mt => mt.Groups[1].Value )

Result

enter image description here

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122