2

I need to parse some text file searching for php classes. So, for example, if I have a text file with this source:

... some text ...

... some other text ...

class Foo{

function Bar($param){ ... do stuff ... }

}

... some other text ...

class Bar{

function Foo(){ ... do something .... }

}

... some else ...

In this case, my regular expression must match the two classes and the content of the classes, to get back this results:

first result:

class Foo{

function Bar($param){ ... do stuff ... }

}

second result:

class Bar{

function Foo(){ ... do something .... }

}

I've tried a lot of times but unlucky. My last test was

/^[\n\r\t ](?:abstract|class|interface){1}(.)[^(?:class|interface)]*$/im

but it only matches

class Foo{

and

class Bar{

without the content of the class.

Thanks for your help :)

famousgarkin
  • 13,687
  • 5
  • 58
  • 74
santino83
  • 21
  • 3
  • 1
    Are you asking how to match the contents of a possibly nested `{ .. }` block structure? – tchrist Nov 11 '10 at 12:19
  • Hi and welcome to Stack Overflow. For posting code, please don't use `>` but rather paste the code as it, select it and press Ctrl-K. This is much better. – Tim Pietzcker Nov 11 '10 at 12:23

1 Answers1

2

This cannot be done with "classic" regular expressions because you'd need to be able to handle arbitrarily nested parentheses, and structures like these are by definition irregular. Some programming languages (.NET, PCRE, Perl 5.6 and up) have augmented regular expressions to support recursive matching, but most implementations can't handle recursion yet.

I'd also wager a bet that even if your favorite language's regex engine can handle recursion, it's usually not the best way to go. Most of the time, you rather want a parser for this.

That said, even without recursive regexes you might have a chance if your code is formatted in a consistent manner (start column of the class definition == column of the closing }, no mix of tabs and spaces, and every sub-level structure is indented).

Then you could try

/^([\t ]*)(?:abstract|class|interface).*?^\1\}/sim

But this is sure to fail horribly if your code is not exactly formatted according to those rules.

Explanation:

^                             # start of line
([\t\ ]*)                     # match and remember whitespace
(?:abstract|class|interface)  # match keyword
.*?                           # match as few characters as possible
^\1                           # until the next line that starts with the same amount of whitespace
\}                            # followed by a }
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Tim Tim Tim, please stop saying this "cannot be done with regexes" stuff. It's [not true](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386). – tchrist Nov 11 '10 at 12:27
  • @tchrist: OK, I have clarified my answer. A little :). I still don't think it's a good thing to use recursion in regular expressions even if some modern dialects can. Regexes are hard enough already... – Tim Pietzcker Nov 11 '10 at 12:50
  • Not perl6. perl5 has had it since at least 5.6 from back last millennium. The cooler buffer recursion thing though is from 5.10 and about three years old. – tchrist Nov 11 '10 at 12:51
  • @TimPietzcker: It depends on what you’re doing. I think a regex can be very maintainable, moreso than a dedicated parser. You just have to use "grammatical" regexes, like [here](http://stackoverflow.com/questions/764247/why-are-regular-expressions-so-controversial/4053506#4053506) and [here](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840). – tchrist Nov 11 '10 at 12:55
  • @tchrist: How about handling `}` s inside comments or strings? Is it feasible to write a regex that finds the correct matching brace for `{ foo { bar "baz{" /* {{comment} */ tutu } tata }`? – Tim Pietzcker Nov 11 '10 at 14:22