1

Alright, I've read the tutorials and scrambled my head too much to be able to see clearly now.

I'm trying to capture parameters and their type info from a function signature. So given a signature like this:

function(/*string*/a,b,c)

I want to get the parts like this:

type: string
param:a
param:b
param:c

This is Ok too:

type: string
param:a
type: null (or whitespace)
param:b
type: null (or whitespace)
param:c

So I came up with this regex which is doing the common mistake of repeating the capture (I've explicit capture turned on):

function\(((\/\*(?<type>[a-zA-Z]+)\*\/)?(?<param>[0-9a-zA-Z_$]+),?)*\)

Problem is, I can't correct the mistake. :(. Please help!

Mrchief
  • 75,126
  • 20
  • 142
  • 189
  • What language are you using? If this is a .Net pattern, you're in luck. Otherwise, it probably isn't possible in a single step. – Kobi May 12 '11 at 18:50
  • I was hoping to solve it without using .Net too, but yeah, I'm using .Net eventually. Also, I've looked at Captures collection but I don't have a reliable way of correlating the captures to the group (or am I overlooking something?). – Mrchief May 12 '11 at 19:03
  • See the posted answer. There's `Match.Captures` which is easier to find but isn't very useful, you usually want `Group.Captures` (I'm guessing here, of course). – Kobi May 12 '11 at 19:07
  • @Kobi - I'm going to leave this open for a while in case someone can solve it through pure regex. Otherwise, yours is the closest solution I have. I tried it and it works too; but then, we knew it would, didn't we? – Mrchief May 12 '11 at 20:07

3 Answers3

3

Generally, you'd need two steps to get all data.
First, match/validate the whole function:

function\((?<parameters>((\/\*[a-zA-Z]+\*\/)?[0-9a-zA-Z_$]+,?)*)\)

Note that now you have a parameters group with all parameters. You can match some of the pattern again to get all matches of parameters, or in this case, split on ,.

If you're using .Net, by any chance, you're in luck. .Net keeps full record of all captures of each group, so you can use the collection:

match.Groups["param"].Captures

Some notes:

  • If you do want to capture more than one type, you definitely want empty matches, so you can easily combine the matches (though you can sort, but a 1-to-1 capture is neater). In that case, you want the optional group inside your captured group: (?<type>(\/\*[a-zA-Z]+\*\/)?)
  • You don't have to escape slashes in .Net patterns - / has no special meaning there (C#/.Net doesn't have regex delimiters).

Here's an example of using the captures. Again, the main point is maintaining the relation between type and param: you want to capture empty types, so you don't lose count.
Pattern:

function
\(
(?:
    (?:
        /\*(?<type>[a-zA-Z]+)\*/    # type within /* */
        |                           # or
        (?<type>)                   # capture an empty type.
    )
    (?<param>
        [0-9a-zA-Z_$]+
    )
    (?:,|(?=\s*\)))     # mandatory comma, unless before the last ')'
)*
\)

Code:

Match match = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection types = match.Groups["type"].Captures;
CaptureCollection parameters = match.Groups["param"].Captures;
for (int i = 0; i < parameters.Count; i++)
{
    string parameter = parameters[i].Value;
    string type = types[i].Value;
    if (String.IsNullOrEmpty(type))
        type = "NO TYPE";
    Console.WriteLine("Parameter: {0}, Type: {1}", parameter, type);
}
embert
  • 7,336
  • 10
  • 49
  • 78
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • I checked again. It does capture multiple types. An alternate idea is to capture the entire string between parenthesis, split on comma and then loop to capture type and params one by one. – Mrchief May 12 '11 at 19:21
  • @Mrchief - Right, my bad! Missed a closing paren `:P` - I've updated the answer. The alternate idea you suggest is what I also suggested, though mistakenly only for the names - I've fixed that as well. – Kobi May 12 '11 at 19:25
  • Somehow I lost my comments here! Ok, I got diverted by your first note earlier and didn't realize that the 'alternate' solution is same as what you'd shown. I tried Group.Captures too but that's little dicey. It does tell all the previous captures but it does that in a flat manner. I need to be able to relate a type with a param so having a flat running list is not very helpful. Slash thing was new for me, point noted! – Mrchief May 12 '11 at 20:05
  • @Mrchief - I think I understand the question now, maybe this update makes sense. Note that while it is fun, you don't *have* to insist on a single regex solution when an alternative might be simpler. – Kobi May 12 '11 at 21:09
  • I understand the importance of empty types now. Your example looks neat (better than split approach aesthetically) Again, I insisted on single regex because I like to use it outside of .Net. – Mrchief May 12 '11 at 21:41
  • @Mrchief - Well, I doubt you'll get it working. .Net is the only flavor that keeps all captures, as far as I'm aware, and there are other differences as well (for example, using the named group twice, but it can be avoided). – Kobi May 12 '11 at 22:04
1

the page you referenced mentioned using ?: for non-capture, then surrounding the repeating capture in its own group. i am guessing they are suggesting something like this function\(((?:(\/\*(?<type>[a-zA-Z]+)\*\/)?(?<param>[0-9a-zA-Z_$]+),?)*)\)

i like to use http://gskinner.com/RegExr/ to test my expressions, but it won't show repeated captures. You may have to loop through the results in whatever return structure you get back to see the values in other non-.NET languages.

sorry i couldn't test more thuroughly...

quietchaos
  • 11
  • 1
  • 1
    http://regexstorm.net/tester is a useful site for testing .Net regular expressions. There's also http://regexhero.net/tester/ , but it is silverlight based and nags you from time to time for money. – Kobi May 12 '11 at 19:31
  • Not only regexhero nags for money, occasionally brings browser down (Silverlight plugin crashes after prolonged use). Regexstorm looks promising. Thx! – Mrchief May 12 '11 at 19:52
  • @Mrchief - What browser/OS are you using? I haven't seen Regex Hero crash the browser. – Steve Wortham Jun 08 '11 at 13:34
  • @Steve - I was using Chrome (11.0.696.68 or older). It feels sluggish in Firefox 4 but then FF is sluggish than Chrome anyway. And I don't remember if it crashed in Firefox. OS is Win XP SP3. – Mrchief Jun 08 '11 at 15:42
  • @Mrchief - Thanks man, I'll bang on some things and see what breaks. – Steve Wortham Jun 08 '11 at 16:38
  • @Mrchief - I found a layout problem which may have conceivably caused the crash. It wasn't causing Chrome to crash for me, but it was throwing an error (and trapping it) behind the scenes. I fixed that problem as well as a strange tooltip bug and some error handling around copy/paste security exceptions. I can't be sure if I fixed the problem you were having, but I'm keeping an eye on it. – Steve Wortham Jun 08 '11 at 18:03
  • @Steve: copy/paste - that smells familiar. I think it crashed during copy/paste too, although I don't remember for sure. I'll keep an eye on it too. – Mrchief Jun 08 '11 at 18:24
  • @Steve - So you created RegexHero? Awesome work man! You can count me in as one of your biggest fans! – Mrchief Jun 08 '11 at 18:25
  • @Mrchief - Yes I did. Thanks man. I'm glad you like it, except for the whole crashing bit. I'm glad I found your comment or I wouldn't have known there was a problem. – Steve Wortham Jun 08 '11 at 19:50
  • Yeah I like it a lot except for the nag part (which I understand why but still don't like the nag). It's amazing how you found this post. Were you looking for some regex help? ;) – Mrchief Jun 08 '11 at 20:51
  • @Mrchief - I have Google Alerts and all sorts of other means of finding mentions of the term "Regex Hero" and I tend to find a lot of them. ;) By the way, I do believe I solved this problem. It stems from a bug in the Silverlight RichTextBox which I was able to work around... http://blog.regexhero.net/2011/06/bug-fixes-around-undoredo-and-copypaste.html – Steve Wortham Jun 28 '11 at 19:56
1

It's been a while since this question was active, but I think I finally found an answer.

I think I was looking for the same situation as you, but for use with PHP, and there is an answer in another post I found that works really well, using the \K and \G commands from PCRE. See Alan Moore's answer here: PHP Regular Expression - Repeating Match of a Group

My issue was trying to pull out all the cell values in a table, where each row contained a 6 digit number, 20x a 1 or 2 digit number, and an unrelated 1 or 2 digit number. The solution was:

<tr class="[^"]*">\s+<td>(\d{6})<\/td>|\G<\/td>[^<>]*+<td>\K\d{1,6}|<td>(\d{1,2})<\/td>

Very nice solution if I do say so myself!

Community
  • 1
  • 1
quietchaos
  • 23
  • 3
  • 1
    The problem is each technology has its own niche way of handling such thing. Wish there was something _within_ the set or regular expressions. So right now if someone is not using .Net or PCRE, they are left out in cold water. – Mrchief Aug 21 '11 at 02:44