3

Given a dummy function as such:

public function handle()
{
  if (isset($input['data']) {
    switch($data) {
      ...
    }
  } else {
    switch($data) {
      ...
    }
  }
}

My intention is to get the contents of that function, the problem is matching nested patterns of curly braces {...}.

I've come across recursive patterns but couldn't get my head around a regex that would match the function's body.

I've tried the following (no recursion):

$pattern = "/function\shandle\([a-zA-Z0-9_\$\s,]+\)?". // match "function handle(...)"
            '[\n\s]?[\t\s]*'. // regardless of the indentation preceding the {
            '{([^{}]*)}/'; // find everything within braces.

preg_match($pattern, $contents, $match);

That pattern doesn't match at all. I am sure it is the last bit that is wrong '{([^{}]*)}/' since that pattern works when there are no other braces within the body.

By replacing it with:

'{([^}]*)}/';

It matched till the closing } of the switch inside the if statement and stopped there (including } of the switch but excluding that of the if).

As well as this pattern, same result:

'{(\K[^}]*(?=)})/m';
revo
  • 47,783
  • 14
  • 74
  • 117
mulkave
  • 831
  • 1
  • 10
  • 22
  • 3
    in what universe do you need to extract a function contents with a regular expression (or by any means) –  Jun 29 '16 at 22:08
  • 1
    I would rethink what I'm doing, if I were you. The problem here is that if you have no idea what the function body can contain, the regex would need to be monstrous, unmaintainable and prone to 100000 gotchas. Imagine that you have a string containing `{` or `}` (no matching start or end brace), then your recursive pattern wouldn't work. And that's just the first situation I thought of. – M. Eriksson Jun 29 '16 at 22:20
  • 1
    Really, try to answer @Dagon's question here - what is your goal ? – Jan Jun 29 '16 at 22:30
  • This is not something regex is well suited for: I can make one, but ultimately the deeper you want it to be able to match, the longer the regex. For each level you need a separate grouping if you want to be able to match them: a regex that supports up to 4 nested `{{{{}}}}` will break on a `{{{{{}}}}}` nesting of 5. – TemporalWolf Jun 29 '16 at 22:41
  • @TemporalWolf - Don't forget that you also need to take into account, and ignore, any brace that's inside single/double quotes, heredoc etc... – M. Eriksson Jun 29 '16 at 22:46
  • @MagnusEriksson Doable, just requires even more regex. As I said, it's probably not the right tool for the job. What he's looking for is a `pushdown automaton` which a regex can't do. It can fake it on easy enough problems, but ultimately the regex would be infinitely long to cover all cases. – TemporalWolf Jun 29 '16 at 23:04
  • @TemporalWolf - I agree that regex isn't the correct tool for this. I would actually argue that you can't make it 100% safe (from bugs) with regex. Consider that you also need to ignore single quotes inside double quotes (so it doesn't think you're still in quotes when your not etc), escaped quotes inside quotes and so on. You will most certainly always miss several different combinations. – M. Eriksson Jun 29 '16 at 23:10
  • @MagnusEriksson In the basic case however, you can if you know what your input is going to be. Either way, I've been proven wrong -> php supports recursive regexes... so it's not all that awful as revo's answer proves. – TemporalWolf Jun 29 '16 at 23:13
  • @TemporalWolf - His answer fails easily (as I commented). If you know what the input is going to be, why then even bother parsing it? Then just hard code it? – M. Eriksson Jun 29 '16 at 23:18
  • @TemporalWolf - Either way. This discussion is kinda moot since the OP hasn't really provided us with feedback to our initial questions. We're all just assuming stuff, at this point. – M. Eriksson Jun 29 '16 at 23:20
  • @MagnusEriksson Thank you for pointing out that it is not the correct approach for the solution, one of the main reasons why I've posted this question. My intention is to read the contents of the function's body and display it as it is (string), nothing beyond than that. Dagon does that answer your question? – mulkave Jun 30 '16 at 08:17
  • @TemporalWolf there is no limit to how many nested levels there could be in the body, which seconds what you're saying. I would love to know more about `pushdown automaton` and how it can solve such a problem, never heard of that term before (will sure do some research but also appreciate your bits on this). – mulkave Jun 30 '16 at 08:20
  • Possible duplicate of [How to match a method block using regex?](http://stackoverflow.com/questions/35912934/how-to-match-a-method-block-using-regex) – SamWhan Jun 30 '16 at 09:17

2 Answers2

9

Update #2

According to others comments

^\s*[\w\s]+\(.*\)\s*\K({((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/|#.*$|<<<\s*["']?(\w+)["']?[^;]+\3;$|[^{}<'"/#]++|[^{}]++|(?1))*)})

Note: A short RegEx i.e. {((?>[^{}]++|(?R))*)} is enough if you know your input does not contain { or } out of PHP syntax.

So a long RegEx, in what evil cases does it work?

  1. You have [{}] in a string between quotation marks ["']
  2. You have those quotation marks escaped inside one another
  3. You have [{}] in a comment block. //... or /*...*/ or #...
  4. You have [{}] in a heredoc or nowdoc <<<STR or <<<['"]STR['"]

Otherwise it is meant to have a pair of opening/closing braces and depth of nested braces is not important.

Do we have a case that it fails?

No unless you have a martian that lives inside your codes.

 ^ \s* [\w\s]+ \( .* \) \s* \K               # how it matches a function definition
 (                             # (1 start)
      {                                      # opening brace
      (                             # (2 start)
           (?>                               # atomic grouping (for its non-capturing purpose only)
                "(?: [^"\\]*+ | \\ . )*"     # double quoted strings
             |  '(?: [^'\\]*+ | \\ . )*'     # single quoted strings
             |  // .* $                      # a comment block starting with //
             |  /\* [\s\S]*? \*/             # a multi line comment block /*...*/
             |  \# .* $                      # a single line comment block starting with #...
             |  <<< \s* ["']?                # heredocs and nowdocs
                ( \w+ )                      # (3) ^
                ["']? [^;]+ \3 ; $           # ^
             |  [^{}<'"/#]++                 # force engine to backtack if it encounters special characters [<'"/#] (possessive)
             |  [^{}]++                      # default matching bahaviour (possessive)
             |  (?1)                         # recurse 1st capturing group
           )*                                # zero to many times of atomic group
      )                             # (2 end)
      }                                      # closing brace
 )                             # (1 end)

Formatting is done by @sln's RegexFormatter software.

What I provided in live demo?

Laravel's Eloquent Model.php file (~3500 lines) randomly is given as input. Check it out: Live demo

Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117
  • 2
    Fails if it is a [class](https://regex101.com/r/xE4qE3/2) or if we have a [string with a brace](https://regex101.com/r/xE4qE3/3) – M. Eriksson Jun 29 '16 at 23:14
  • 1
    Again, regex is the wrong tool and, just as @Dagon and I already said, it can't be done with regex since you won't be able to account for all situations and variations. You will basically spend the rest of your life patching the regex. Strings are just one situation.. then there are comments you need to handle as well and so on... – M. Eriksson Jun 30 '16 at 06:27
  • 2
    You shouldn't tell that to a RegEx guy since he knows it well. Regular Expressions as a *powerful text-processing* tool is not a simple utility for doing only cheap works, its aim is to do heavy text-related things perfectly. If it wasn't the need, we would have stuck to UNIX wildcards and simple file patterns. By the way I'm not patching anything, that short RegEx at first, was enough for what OP desires. But as long as someone like you tries to show me what could be wrong with it, I will do patching. @MagnusEriksson – revo Jun 30 '16 at 08:30
  • 1
    Also think about it, I never said ***Only RegEx*** and I won't never say it. If you have a tool which has done much of that heavy and nasty jobs in the background then feel free to show it to OP. And... don't up-vote each other's comment too. Confirming each other doesn't help. @MagnusEriksson – revo Jun 30 '16 at 08:31
  • 1
    @revo impressive! Thank you for that solution, it worked. – mulkave Jun 30 '16 at 09:45
  • Up voted - In our project, which uses SquishAPI (and therefore relies heavily on `"{...}"` strings to identify objects), we only have 63 occurrences in 80kLOC of `{`/`}` inside strings... I think this will work dandy in most cases. – TemporalWolf Jun 30 '16 at 16:24
1

This works to output header file (.h) out of inline function blocks (.c)

Find Regular expression:

(void\s[^{};]*)\n^\{($[^}$]*)\}$

Replace with:

$1;

For input:

void bar(int var)
{ 
    foo(var);
    foo2();
}

will output:

void bar(int var);

Get the body of the function block with second matched pattern :

$2

will output:

    foo(var);
    foo2();
TianaR
  • 119
  • 1
  • 4