0

Groups are not capturing what I expect in C# regex. Here is a very simple example:

var matches = Regex.Matches("abcdededefgh","(abc)(de)*(fgh)");

I thought this would capture abc, dededede and fgh as 3 separate groups, since each has a separate set of parentheses. It does, but it also captures the whole string as a group (as the first captured group of four). Given that I don't have a parentheses around the whole pattern (i.e. my pattern is not "((abc)(de)*(fgh))"), I don't understand why the extra group is being captured. This makes it confusing for me to predict the behavior and determine which group I can expect to correspond to what portion of the string.

Also, note that the following has the same 4 group result, so the fact that the "0 to many" asterisk is outside the group parentheses in the above example does not seem to impact the result.

var matches = Regex.Matches("abcdededefgh","(abc)((?:de)*)(fgh)");

Many thanks for any assistance!

user756366
  • 467
  • 6
  • 24
  • There actually is no difference between groups and captures. Why someone would even ask is beyond compare. –  Feb 28 '17 at 00:45
  • Thank you very much Wiktor for linking to that very helpful explanation of groups and captures! I had understood "groups" to be only the items surrounded by parentheses, and hadn't realized that a "$0" group for the full match existed to be referenced. sln, I'm not sure why you feel the need to be insulting. There are many things I do understand about regexes, but this one I did not, so I asked, in a nice simple and clear example that I figured would not take a lot of time for someone who understood it to answer, and for which I could not think of search terms to find the answer. – user756366 Feb 28 '17 at 01:43
  • Thanks for complement. In the comments on the accepted answer, I posted an updated technical breakdown of capture collections on that old post if you're interested. Otherwise I would not have posted there had not you been so curious. I hope you enjoy _Regular Expressions_ as much as we do. Feel free to ask questions, however, do read up on the basics a bit.. –  Feb 28 '17 at 02:56
  • Thank you - yes, I did find your additional comment below helpful! I do enjoy Regexes, and actually do have a pretty good grasp of lookahead, lookbehind, character classes, greedy vs non-greedy capture, /s, /w, start of string and end of string characters, and have used all of those various times through my career. I've done some replacement, but not complex replacements in the past. My question was actually not indicative that I've never read through the basics, but rather that there was one point that got missed, because my level at this point is still advanced-beg/intermediate in this area. – user756366 Mar 02 '17 at 21:45
  • My comments below your post here have all referred to the _duplicate_ [What's the difference between “groups” and “captures” in .NET regular expressions?](http://stackoverflow.com/questions/3320823/whats-the-difference-between-groups-and-captures-in-net-regular-expression) See that post for an extended discussion. –  Mar 02 '17 at 21:53
  • Aah, sorry - I'd looked thru that for a full answer by you, but had missed that the comments were yours. Do I understand correctly - I think you're saying that.NET's ability to show the detail of all passes in a quantified capture from a single Regex call provides efficiency, because the alternative in a language that doesn't provide that ability is to run a find next command for each one. I'm not sure though if you're also saying that the ability to see all captures is needed often? Or that even if the caller doesn't need to see each capture this .NET characteristic still aids efficiency? – user756366 Mar 02 '17 at 22:15
  • Yeah, like I was saying there. Capturing is there by default, I don't think you can turn it off. Given that, the uses of _captures_ has long been the shortcoming in other engines, simply because it overwrites and forces one to find the largest _fixed_ number of groups that is needed, then _pause_, exit the object code of the engine while you're saving some string, then re-enter the engine, with some state fix up. Overhead! Then there is the dynamic _variable_ amount of capture groups it facilitates, both horizontally (fields) and vertically (records). –  Mar 03 '17 at 01:20
  • Performance can be compared to a _replaceAll()_ type function, vs. a find/find-next doing your own text maintainance. –  Mar 03 '17 at 01:22

1 Answers1

0

The capture group is overwritten each quantified pass (a)*
Change it to (abc)((?:de)*)(fgh)

The extra group you see includes group 0 which is the overall match
of the regex. So group 0,1,2,3 = 4 groups.

  • Can you explain the concept of a qualified pass? – Daniel Feb 28 '17 at 00:29
  • Yeah, it's called a _quantified_ pass. Each time a quantified group can match, it adds the previous capture to the overall capture ( group 0 ), then clears the capture group that is quantified, then fills it up with a new capture... So `(.)*` on "abc" will fill group 1 with first 'a', clear, then 'b', clear, then 'c', etc.. You end up seeing group 0 = abc, group 1 = c. –  Feb 28 '17 at 00:31
  • You can however use capture collections to actually store those quantified captures into an array. You can then access them. This could be important should you wish to do something like `(?.*?(dat.a))+` which, on a single match can store a great deal of information. And, by definition, this regex is considered a quantified capture. This is the _only_ thing I like about .Net regex's –  Feb 28 '17 at 00:36
  • Thank you very much for the posted answer sln, and for the explanation regarding the quantified pass. The ability to capture all of the matches will be very helpful to me! Marking as answer, as this does indeed answer my question, although I also would request you to refrain from making insulting comments such as the one you posted under the main question in the future, as it is hurtful and I think does not serve a useful purpose. – user756366 Feb 28 '17 at 01:48