1

Got an interesting problem here for everyone to consider:

I am trying to parse and tokenize strings delimited by a "/" character but only when not in between parenthesis.

For instance:

Root/Branch1/branch2/leaf

Should be tokenized as: "Root", "Branch1", "Branch2", "leaf"

Root/Branch1(subbranch1/subbranch2)/leaf

Should be tokenized as: "Root", "Branch1(subbranch1,subbranch2)", "leaf"

Root(branch1/branch2) text (branch3/branch4) text/Root(branch1/branch2)/Leaf

Should be tokenized as: "Root(branch1/branch2) text(branch3/branch4)", "Root(branch1/branch2)", "leaf".

I came up with the following expression which works great for all cases except ONE!

([^/()]*\((?<=\().*(?=\))\)[^/()]*)|([^/()]+)

The only case where this does not work is the following test condition:

Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf

This should be tokenized as: "Root(branch1/branch2)", "SubRoot", "SubRoot(branch3/branch4)", "Leaf"

The result I get instead consists of only one group that matches the whole line so it is not tokenizing it at all:

"Root(branch1/branch2)/SubRoot/SubRoot(branch3/branch4)/Leaf"

What is happening here is that because Regex is greedy it is matching the left most opening parenthesis "(" with the last closing parenthesis ")" instead of just knowing to stop at its appropriate delimiter.

Any of you Regex gurus out there can help me figure out how to add a small Regex piece to my existing expression to handle this additional case?

Root(branch1/branch2) Test (branch3/branch4)/SubRoot/SubRoot(branch5/branch6)/Leaf

Should be tokenized into groups as:

"Root(branch1/branch2) Test (branch3/branch4)"
"SubRoot"
"SubRoot(branch5/branch6)"
"Leaf"
Cœur
  • 37,241
  • 25
  • 195
  • 267
Sam A.
  • 87
  • 2
  • 8

3 Answers3

1
List<string> Tokenize(strInput)
{
  var sb = new StringBuilder();
  var tokens = new List<string>();
  bool inParen = false;
  foreach(var c in strInput)
  {
      if (inParens)
      {
           if (c == ')')
               inParens = false;
           else
               sb.Append(c);
       }
       else if (c == '(')
               inParens = true;
       else if (c == '/')
            {
                 tokens.Add(sb.ToString());
                 sb.Length = 0;
            }
       else
             sb.Append(c);

  }
  if (sb.Length > 0)
      tokens.Add(sb.ToString());

  return tokens;
}

That's untested but it should work. (and will almost certainly be much faster than the regex)

James Curran
  • 101,701
  • 37
  • 181
  • 258
  • Not if you write the regex properly ;=) (IMHO) – robinCTS Jan 23 '13 at 21:04
  • Thanks for your response James. Unfortunately, spec requests this be done in .Net Regex rather than code. – Sam A. Jan 23 '13 at 21:08
  • Hmmm... The "spec" requires? The only time I've run across an assignment which dictated implementation was homework. – James Curran Jan 25 '13 at 12:33
  • @robinCTS:IMHO - only if you write the code really badly. Look at the code. What could a Regex do that could possibly speed it up? If you wrote the regex optimally, AND compiled it, then you MIGHT be able to match the hand-coded version, but even that would be difficult, as regex just has more inherent overhead. (e.g, the Matches collection is more complex than the simple string list I return) Regexes have their place--I've done a lot in Regex that I'd never try hand-coding--but in this case, where the pattern is rather simple, except for one twist which is difficult in Regex, code is best. – James Curran Jan 25 '13 at 12:42
  • I have to agree with your general observations there. Even that your solution is faster. Not sure about the "much faster". I'm a bit of a regex optimization freak :) – robinCTS Jan 25 '13 at 12:59
  • @JamesCurran LOL. not in that case. this expression will be used to parse our data for multiple platforms so it was required to be a Regex. – Sam A. Jan 26 '13 at 23:35
1

Different approach, trying to avoid costly look-around assertions...

/(\(.+?\)|[^\/(]+)+/

With some comments...

/
(           # group things to be captured
  \(.+?\)   # 1 or more of anything in (escaped) brackets, un-greedily
|           # or ...
  [^\/(]+   # 1 or more, not slash, and not open bracket characters
)+          # repeat until done...
/
Billy Moon
  • 57,113
  • 24
  • 136
  • 237
  • I tend to avoid the `.` if at all possible and use the possessive quantifier as much as possible. That would make the pattern `/(\([^)]++\)|[^\/(]++)+/`. In general this speeds up evaluation and helps safeguard against catastrophic backtracking. Unfortunately some regex engines don't do possessive quantifiers. – robinCTS Jan 23 '13 at 21:00
  • You could also use `[^)]` instead of the `.`, although I doubt it would make a difference in backtracking. – Billy Moon Jan 23 '13 at 21:03
  • I do it as a safety precaution. I this case you are right in that it won't make any difference. However my way allows me to use the possessive quantifier and _guarantee_ no backtracking will occur. – robinCTS Jan 23 '13 at 21:13
  • Hi Billy. Thanks for your suggestion. I went ahead with @Rawling's suggestion below since it most met my task at hand. Thanks again for your help. – Sam A. Jan 24 '13 at 15:19
  • @SamA. Happy to help - it has been an education to me. I have become aware of both possessive quantifiers (which can be emulated with atomic grouping) and balancing group (which I doubt I will even use, as it's language specific, and I try to use widely supported regex only). Thanks – Billy Moon Jan 25 '13 at 10:32
  • @BillyMoon - Actually, possessive quantifiers are syntactic sugar for the equivalent atomic group, ie, `.++` is exactly the same as `(?>.+)`. Interestingly enough, some regex engines support atomic groups, but _not_ possessive quantifiers. – robinCTS Jan 26 '13 at 12:57
0

The following uses balanced groups to capture each matching item with Regex.Matches, ensuring the closing / isn't matched when the brackets before it haven't balanced:

(?<=^|/)((?<br>\()|(?<-br>\))|[^()])*?(?(br)(?!))(?=$|/)

Bizarrely, this seems to perform similarly to Billy Moon's much simpler answer, even though this is overengineered (supporting multiple, possibly nested sets of brackets per token).


The following does something similar, but splits the string with Regex.Split (linebreaks added for clarity):

(?<=^(?(brb)(?!))(?:(?<-brb>\()|(?<brb>\))|[^()])*)
/
(?=(?:(?<bra>\()|(?<-bra>\))|[^()])*(?(bra)(?!))$)

This matches "any / where any brackets between the start of the string and the / are balanced, and any bracket between the / and the end of the string are balanced".

Note that in the lookbehind, the brb captures appear in reverse order from before - this is because a lookbehind apparently works right-to-left. (Thanks to Kobi for the answer that taught me this.)

This is much slower than the match version, but I wanted to work out how to do it anyway.

Community
  • 1
  • 1
Rawling
  • 49,248
  • 7
  • 89
  • 127
  • `(?<-BR>\)` looks weird. Is this valid syntax? – nhahtdh Jan 23 '13 at 20:30
  • That's supposed to be the "balance" to the first `
    ` group, e.g. it can only match if there are matched `
    ` groups, and it removes one from the stack. That said, I don't have immediate access to a compiler and none of the online regex tools I've found seem to support this feature.
    – Rawling Jan 23 '13 at 20:38
  • Ah, just found it. Balancing group... This seems to be .NET only feature. – nhahtdh Jan 23 '13 at 20:42
  • Hello Rawling, Thanks for your input. This works great for most cases, but does not handle cases like this: "Root(test1/test2) Text (test3/test4)/Branch/Leaf". This needs to be tokenized as "Root(test1/Test2) Text (test3/test4)" , "Branch" , "Leaf". However, the Regex you suggest tokenizes as :"Root(test1/test2) Text(test3", "Branch", "Leaf" with the first token group missing part of the expression string. Any suggestions to fix this case? Thanks! – Sam A. Jan 23 '13 at 21:15
  • @SamA. I've updated the expression - it's still stabbing in the dark without a compiler but I think I see where I went wrong. EDIT: Download Expresso, this semms to work :) – Rawling Jan 23 '13 at 21:27
  • 1
    Thanks @Rawling. I went ahead with you answer and it works for me. and Expresso Rocks! have been using it for a while now to dive a bit in Regex. – Sam A. Jan 24 '13 at 15:39
  • @Rawling do you have any insight into why your regex performs similarly to mine? I wrote mine, expecting it to be much faster, because it is not using costly assertions, which yours seems to have a few of. Are regex engines better now at handling assertions? Could I somehow improve the performance of my regex by another method? (I am going to look into possessive quantifiers). Cheers – Billy Moon Jan 25 '13 at 15:44
  • @Billy Honestly, I'm not really sure. Yours is really straightforward, it should be much faster. I wonder if there's a regex profiler tool that could explain it. – Rawling Jan 25 '13 at 20:32