0

I am trying to parse codes inside a function. The code is placed between { and } and it can span several lines obviously.

Coming from a different educational background, I don't have a formal education in computer engineering and I haven't passed a course on compilers. So, please excuse me if my question is basic or if there's no regex solution to this problem. If there's another solution, like context-free grammars, feel free to mention it and explain it please.

So, I have come up with this regular expression: /(.*\s+function\s+.*)\{(\n|.*)+\}/

The problem with this expression is that if we have multiple such expressions, it will capture all of them in one group. For example, the following text will be captured as one group instead of two separate groups containing only the code blocks:

private function blah() {
\\ CODE #1: some code that has to be captured
}

private function blah2(string $test) {
\\ CODE #2: some other code that has to be captured separately
}

Is it possible to use regex to capture the two blocks separately? My regex engine is PHP 7.4+.

  • You can use the `s` flag to make `.` match newline, then you don't need `\n|.` – Barmar Apr 28 '22 at 22:03
  • You need to make the quantifier non-greedy so it matches the shortest string, not the longest. – Barmar Apr 28 '22 at 22:04
  • So you want `\{.*?\}` along with the `s` flag. – Barmar Apr 28 '22 at 22:04
  • 1
    Howver, this won't work if there are nested braces in the function, because then you'll stop at the first `}`. Regular expressions aren't generally appropriate for matching recursive patterns like this. You should use a real parser. – Barmar Apr 28 '22 at 22:06
  • `(?<={)[^}]*` matches bits between the braces, assuming braces are balanced and non-overlapping. [Demo](https://regex101.com/r/CUtTK3/1). Braces are balanced and non-overlapping if for every '{' there is a '}' later in the string with no characters '{' or '}' in between and for every '}' there is a '{' earlier in the string with no characters '{' or '}' in between. – Cary Swoveland Apr 28 '22 at 22:10
  • Also, see [Match the body of a function using Regex](https://stackoverflow.com/questions/38110833/match-the-body-of-a-function-using-regex). – MikeM Apr 28 '22 at 22:11
  • @Barmar Thanks. If I understood you correctly, you mean something like ```/(.*?\s+Mutation\s+.*?)\{(.*?)\}/s```. Right? This seems to work. How are real parsers written though? Don't they use regex? – PizzaIsLove Apr 28 '22 at 22:11
  • @CarySwoveland Thank you. I don't have enough reputation to upvote your comment but that's an interesting regular expression. Could you explain how it works and post it as an answer? – PizzaIsLove Apr 28 '22 at 22:13
  • Yes, that's correct. – Barmar Apr 28 '22 at 22:14
  • @PizzaIsLove if the question is already answered there, there's no need to post it as an answer here. – Barmar Apr 28 '22 at 22:14
  • @MikeM Thank you. That's a very useful post and it's quite relevant to what I'm looking for. – PizzaIsLove Apr 28 '22 at 22:14
  • Alas, this question has been asked many times so I expect it will be closed (duplicate of earlier question), but `[^}]*` matches zero or more characters other than a `'}'` and `(?<={)` is a *positive lookbehind* that requires that match to be immediately preceded by a `'{'`. That `'{'` is not part of the match that is returned. There are also *negative lookbehinds* and *positive* and *negative* *lookaheads*. – Cary Swoveland Apr 28 '22 at 22:19

0 Answers0