2

I'm parsing html code in a C# project.

Assuming that we have this string:

<a href="javascript:func('data1','data2'...)">...</a>

Or that after the necessary .subtring()'s this one:

func('data1','data2'...)

What would be the best Regex pattern to retrieve func()'s parameters avoiding counting on delimiter characters (' and ,) as they could sometimes be part of the parameter's string?

n1nsa1d00
  • 856
  • 9
  • 23
  • Sometimes Regex is not the best tool for the job... specifically when what you're dealing with does not have a regular pattern. – BoltBait Sep 10 '15 at 23:14
  • @BoltBait I was actually using a `substring` and `indexOf` approach when I realized that I should have been more careful about my usage of the delimiter characters, so I thought that `Regex` would have solve and simplify everything... Anyway, do you know about other alternatives? – n1nsa1d00 Sep 10 '15 at 23:54
  • When you have a problem that you think can be solved with regular expressions, [you now have 2 problems](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/). (Atwood quoting Zawinski) – theB Sep 11 '15 at 00:00
  • Reason why my question is getting down-voted ?? – n1nsa1d00 Sep 11 '15 at 11:09

2 Answers2

5

You should not use regex to parse programming language code, because it's not a regular language. This article explains why: Can regular expressions be used to match nested patterns?


And to prove my point, allow me to share an actual solution with a regex that I think will match what you want:

^                               # Start of string
[^()'""]+\(                     # matches `func(`
                                #
(?>                             # START - Iterator (match each parameter)
 (?(param)\s*,(?>\s*))          # if it's not the 1st parameter, start with a `,`
 (?'param'                      # opens 'param' (main group, captures each parameter)
                                #
   (?>                          # Group: matches every char in parameter
      (?'qt'['""])              #  ALTERNATIVE 1: strings (matches ""foo"",'ba\'r','g)o\'o')
      (?:                       #   match anything inside quotes
        [^\\'""]+               #    any char except quotes or escapes
        |(?!\k'qt')['""]        #    or the quotes not used here (ie ""double'quotes"")
        |\\.                    #    or any escaped char
      )*                        #   repeat: *
      \k'qt'                    #   close quotes
   |  (?'parens'\()             #  ALTERNATIVE 2: `(` open nested parens (nested func)
   |  (?'-parens'\))            #  ALTERNATIVE 3: `)` close nested parens
   |  (?'braces'\{)             #  ALTERNATIVE 4: `{` open braces
   |  (?'-braces'})             #  ALTERNATIVE 5: `}` close braces
   |  [^,(){}\\'""]             #  ALTERNATIVE 6: anything else (var, funcName, operator, etc)
   |  (?(parens),)              #  ALTERNATIVE 7: `,` a comma if inside parens
   |  (?(braces),)              #  ALTERNATIVE 8: `,` a comma if inside braces
   )*                           # Repeat: *
                                # CONDITIONS:
  (?(parens)(?!))               #  a. balanced parens
  (?(braces)(?!))               #  b. balanced braces
  (?<!\s)                       #  c. no trailing spaces
                                #
 )                              # closes 'param'
)*                              # Repeat the whole thing once for every parameter
                                #
\s*\)\s*(?:;\s*)?               # matches `)` at the end if func(), maybe with a `;`
$                               # END

One-liner:

^[^()'""]+\((?>(?(param)\s*,(?>\s*))(?'param'(?>(?'qt'['""])(?:[^\\'""]+|(?!\k'qt')['""]|\\.)*\k'qt'|(?'parens'\()|(?'-parens'\))|(?'braces'\{)|(?'-braces'})|[^,(){}\\'""]|(?(parens),)|(?(braces),))*(?(parens)(?!))(?(braces)(?!))(?<!\s)))*\s*\)\s*(?:;\s*)?$

Test online

As you can imagine by now (if you're still reading), even with an indented pattern and with comments for every construct, this regex is unreadable, quite difficult to mantain and almost impossible to debug... And I can guess there will be exceptions that would make it fail.

Just in case a stubborn mind is still interested, here's a link to the logic behind it: Matching Nested Constructs with Balancing Groups (regular-expressions.info)

Community
  • 1
  • 1
Mariano
  • 6,423
  • 4
  • 31
  • 47
-2

Try this

            string input = "<a href=\"javascript:func('data1','data2'...)\">...</a>";

            string pattern1 = @"\w+\((?'parameters'[^\)]+)\)";

            Regex expr1 = new Regex(pattern1);
            Match match1 = expr1.Match(input);
            string parameters = match1.Groups["parameters"].Value;

            string pattern2 = @"\w+";
            Regex expr2 = new Regex(pattern2);
            MatchCollection matches = expr2.Matches(parameters);

            List<string> results = new List<string>();
            foreach (Match match in matches)
            {
                results.Add(match.Value);
            }​
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • Any example of `results` var content ? Can't I have results in an `array` or a `list` ? – n1nsa1d00 Sep 11 '15 at 00:01
  • Give example of how you want results. Do you want entire string, function name, parameters. 'var' automatically puts results into array/list so your comment doesn't make a lot of sense. Using var just make it easier for programmers to get results, but in my opinion violates good programming practices and make the code harder to understand. I often with Linq try to replace var with actually objects to make the code more understandable. It doesn't change the results. Linq still returns arrays/list. – jdweng Sep 11 '15 at 04:54
  • Chill brother !... For var I was shortcutting the word 'variable' *Any example of `results` variable content ?.. Indeed there are no back-quotes surrounding 'var'. Then I think I was quite specific – n1nsa1d00 Sep 11 '15 at 06:55
  • I modified code to extract the parameters. Much easier to get parameters using 2 regex than the mess Mariano posted. – jdweng Sep 11 '15 at 09:07
  • 1
    That mess was to prove regex shouldn't be used here. I'm not the downvoter, but your solution would fail with nested functions as parameters, or with any parameter that isn't `\w+`. – Mariano Sep 12 '15 at 01:14
  • Cound change to [\w\d_]. I don't remember seeing nested functions in java. – jdweng Sep 12 '15 at 04:58
  • 1
    `[\w\d_]` matches *exactly* the same as `\w`. I believe the OP is trying to parse a JavaScript function call, that could pass another function as parameter (nesting parens). For instance, `sample = @"f(""t,x,t"",function(evt){foo(""a,b"",$c);},$bar);"`. [Test](http://rextester.com/QJWE42214) – Mariano Sep 13 '15 at 06:46