1

I have a problem where I have no idea how to solve this and if regular expression are the best way. My idea is to get the name,variables,content of functions in a file. This is my regular expression:

preg_match_all('/function (.*?)\((.*?)\)(.*?)\{(.*?)\}/s',$content,$funcs,PREG_SET_ORDER);  

And I have this testfile:

function testfunc($text)
{

if ($text)
{
    return 1;
}
return 0;
}

Of course I will get everything until "}" before return 0; Is there a way to get everything in the function so find the right "}".

anubhava
  • 761,203
  • 64
  • 569
  • 643
Wikunia
  • 1,564
  • 1
  • 16
  • 37
  • If you want to do this properly and/or extend this to something more widely usable you'd need to use or create your own sort of "function parser". – tenub Jan 21 '14 at 21:35

4 Answers4

3

Contrary to many beliefs PHP (PCRE) has something called Recursive Pattern Regex that lets you find matching nested brackets. Consider this code:

$str = <<<'EOF'
function testfunc($text) {
   if ($text) {
       return 1;
   }
   return 0;
}
EOF;

if ( preg_match('/ \{ ( (?: [^{}]* | (?0) )+ ) \} /x', $str, $m) )
   echo $m[0];

OUTPUT:

{
   if ($text) {
       return 1;
   }
   return 0;
}

UPDATE: To capture function name and arguments as well try this code:

$str = <<<'EOF'
function testfunc($text) {
   if ($text) {
       return 1;
   }
   return 0;
}
EOF;
if ( preg_match('/ (function [^{]+ ) ( \{ (?: [^{}]* | (?-1) )* \} ) /x', $str, $m) )
   print_r ($m);

OUTPUT

Array
(
    [0] => function testfunc($text) {
   if ($text) {
       return 1;
   }
   return 0;
}
    [1] => function testfunc($text) 
    [2] => {
   if ($text) {
       return 1;
   }
   return 0;
}
)

Working Online Demo: http://ideone.com/duQw9c

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I don't want to start a war with regexp fanatics, but regexp-based parsers are very fragile and terribly difficult to maintain. This regexp is anything but readable. I mean, if you're not knee-deep in regep on a daily basis, I doubt you can understand what it does if you look at it 6 months later. Besides, each new parser functionality will require more complex regeps and you will soon end up wading in cryptic code. – kuroi neko Jan 22 '14 at 00:12
  • It all depends on the requirements. If requirement are small, one-off kind of to grab this content to do other things then regex might be fine. However a full fledged parser should be preferred for bigger jobs though language parsers are also not that easy to integrate with. – anubhava Jan 22 '14 at 04:25
  • Agreed, but if you start with regexps you don't leave much room for improvement. I would rather use regexps for optimization than for initial developements. I also agree parsers are not easy to use, but they leave more freedom to adapt and modify a design. – kuroi neko Jan 22 '14 at 04:38
  • Only want a function which gives me all functions in a file (included name,variables and content inside) so no real parser thanks for your code ;) – Wikunia Jan 23 '14 at 12:41
  • Unfortunately my modification http://regexr.com?383mi isn't working :/ Do you have any idea why (\{ ( (?: [^{}]* | (?0) )+ ) \}) and function\s(.*?)\((.*?)\)\s? are both working but together it has some problems :/ – Wikunia Jan 23 '14 at 13:28
  • I don't know if `regexr.com` can support this advanced feature. Try that in PHP code (try on ideone.com) – anubhava Jan 23 '14 at 13:30
  • I tried it with my own programm as well and it's not working :/ http://ideone.com/GFBPWd – Wikunia Jan 23 '14 at 21:40
  • Yes, but I want to get "everything" and not only the content ;) So (name of function,function input-variables and content inside) In this example: ideone.com/duQw9c I need array('uniord__','$c','$h = ord($c{0});...'); – Wikunia Jan 24 '14 at 15:34
  • So you want to capture everything before first `{`? – anubhava Jan 24 '14 at 15:36
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/46024/discussion-between-wikunia-and-anubhava) – Wikunia Jan 24 '14 at 15:37
  • 1
    Check **UPDATE** section. – anubhava Jan 25 '14 at 07:03
1

Regular expressions are not the best tool for that job. Parsers are.

No doubt you can use regexp callbacks to eventually manage what you intend, but this would be ungodly obfuscated and fragile.

A parser can easily do the same job. Better still, if you are planning on parsing PHP with PHP, you can use the Zend parser that does the job for you.

kuroi neko
  • 8,479
  • 1
  • 19
  • 43
0

Not in general, (you can of course define a regex for two levels deep parsing that would be something like function (.*)\((.*)\)(.*)\{([^}]*(\{[^}]*\})*)\} but since you can nest such structures arbitrarily deep, you will eventually run out of regex :D ). One needs a context free grammar to do this.

You can generate such grammar parsers for instance with Yacc, Bison, Gppg,...

Furthermore you don't need to state .*?, .* means zero or more times, .+ means one time or more.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
0

Is there a way to get everything in the function so find the right "}".

Short Answer: no.

Long Answer: This can not be handled with a single Expression. { and } can also appear inside a method body, making it hard to find the correct ending }. You would need to process (iterative or recursive) ALL pairs of {} and manually sort out ALL Pairs, that have a "method name" in front of it.

This, however isn't simple either, because you need to exclude all the Statements, that look like a function but are valid inside the method body.

I don't think, that Regex is the way to go for such a task. EVEN if you would manage to create all the required Regex-Pattern - Performance would be worse compared to any dedicated parser.

dognose
  • 20,360
  • 9
  • 61
  • 107