0

I'm having trouble finding the regex that matches the start and end chars of a php class, which are { and } respectively. The regex should also not match the { and } if they are inside php comments, in other words it should not match if the { or } is preceded by any char but whitespace.

I suppose I should use negative look behind, but I'm a little rusty on regex, and so far I didn't found the solution.

Here is my test string:

<?php


namespace Ling\Light_TaskScheduler\Service;


/**
 * The LightTaskSchedulerService class. :{
 */
class LightTaskSchedulerService
{

    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");


    }


}


// this can happen in comments: }, why
// more stuff







And my pattern, which doesn't work at the moment, is this:

    if(preg_match('!^\s*\{\s*(.*)(?<![^\s]*)\}!ms', $c, $match)){
        a($match);
    }

So, I used multiline modifier "m", since we need to parse a multiline string, then I used the "s" modifier so that the dot matches line breaks, but then the negative look behind part (?<![^\s]*) doesn't seem to work. I'm basically trying to say don't match the "}" char if it's preceded by anything but a whitespace.

@Wiktor Stribiżew: I tried this pattern but it still doesn't work: !^\s*\{\s*(.*)(?<!\S)\}!ms

Considering Tim Biegeleisen's comment, I'll probably take a simpler approach, like removing the comments first, and then do the simpler regex !^\s*\{\s*(.*)\}!ms, which I know will work.

However, if somebody knows a regex that does it, I would be interested in seeing it.

Problem solved for now, I'm out, thanks guys.

@Wiktor Stribiżew

The weird thing is that your regex works on the regex101 website, but it doesn't work in my version of php (PHP 7.2.31).

So I mean: this doesn't work in my php world:

$c = <<<'EEE'
<?php

/**
 * The LightTaskSchedulerService class. :{
 */
class LightTaskSchedulerService
{

    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");


    }


}


// this can happen in comments: }, why
// more stuff


EEE;



if(preg_match('/^\s*\{\s*(.*)(?<!\S)\}$/gms', $c, $match)){
    echo "a match was found"; // is never displayed
}
exit;

So I don't know what regex101 is using under the hood, but doesn't work for me.

UPDATE

As Tim suggested, regex might not be the most appropriate tool for this job.

I ended up using a very simple solution to find the end character, and something similar can be applied to find the start character:

    /**
     * Returns an array containing information related to the end of the class.
     *
     * Important note, this method assumes that:
     *
     * - the parsed php file contains valid php code
     * - the parsed php file contains only one class
     *
     * If either the above assumptions are not true, then this method won't work properly.
     *
     *
     *
     * The returned array has the following structure:
     *
     *
     * - endLine: int, the number of the line containing the class declaration's last char
     * - lastLineContent: string, the content of the last line being part of the class declaration
     *
     *
     * @return array
     */
    public function getClassLastLineInfo(): array
    {

        $lastLineNumber = null;
        $lastLineContent = null;


        $lines = file($this->file);
        $reversedLines = array_reverse($lines);
        foreach ($reversedLines as $k => $line) {
            if ('}' === trim($line)) {
                $n = count($lines);
                $lastLineNumber = $n - $k;
                $lastLineContent = $line;
                break;
            }
        }

        return [
            "endLine" => $lastLineNumber,
            "lastLineContent" => $lastLineContent,
        ];
    }

With something similar for the start char, we basically can obtain the line numbers of the start and end characters of the class, and armed with those, we can simply get all the lines of the string as an array, and use a combination of array_slice/implode to "recompile" the content of the class.

Anyway, thanks for the comments.

ling
  • 9,545
  • 4
  • 52
  • 49
  • `(?<![^\s]*)` never works as its pattern can match an empty string. Did you mean `(?<!\S)`? See https://regex101.com/r/KLGeOS/1 – Wiktor Stribiżew Jul 23 '20 at 08:23
  • 2
    Regex alone is probably _not_ an appropriate tool for what you are trying to do. Consider how your IDE identifies the body of a class: it _parses_ the entire class, counting `{` and `}` to find the start and end. Regex alone can't do this. – Tim Biegeleisen Jul 23 '20 at 08:28
  • What does "I tried this pattern but it still doesn't work" mean? How *does* it work and how is it supposed to work? – Wiktor Stribiżew Jul 23 '20 at 08:37
  • Oops sorry. Wiktor, I didn't browse your link, because I tested the pattern in my code directly, and it didn't work, meaning with my test string, and in php, using the preg_match function as shown in my post, it didn't match. But now I see that in your link it finds the match. So I don't know. I tried to add the g flag in my code, still didn't work. I will try again, because if it works in regex101, I reckon it's supposed to work in my code also, I'll keep trying on that, thanks... – ling Jul 23 '20 at 08:47
  • Well, I don't know, it works on the regex101 website, but not in my code. I'm using PHP 7.2.31, maybe there is a regex flavor difference too, I don't know, but thanks for the regex anyway. – ling Jul 23 '20 at 08:57
  • @F. Müller, doesn't work either, but thanks. – ling Jul 23 '20 at 08:59
  • 1
    The answers of [this question](https://stackoverflow.com/q/6751105/4265352) explain why it is not possible to parse HTML or XML using regular expressions. PHP code is not very different on this regard. The regular expressions are very good to recognize the lexical tokens of a language (keywords, numbers, strings, identifiers, operators etc) but this is where the power of regex ends. A more powerful tool is needed to put the tokens together and recognize syntactic constructions (`if/then/else` blocks, loops, class declarations, function calls, expressions etc). – axiac Jul 23 '20 at 12:17

1 Answers1

1

UPDATE

As people have already stated in the comment section: Regex might not be the best solution to do this. Anyway, you asked for it and I tested it with the class below.

// 1) without class check -> this does not work with code on line with opening {
preg_match('/(?:^{(?!\r?\n?\s*\*\/)|{\s*$(?!\r?\n?\s*\*\/)).+^\s*}(?!\r?\n?\s*\*\/)/ms', $c, $match);

// 2) with class check -> this should always work
preg_match('/^[\s\w]+?(?:{(?!\r?\n?\s*\*\/)|{\s*$(?!\r?\n?\s*\*\/)).+^\s*}(?!\r?\n?\s*\*\/)/ms', $c, $match);

// 3) with class check and capturing the second part (non-class-definition) separately -> this should always work
preg_match('/^[\s\w]+?((?:{(?!\r?\n?\s*\*\/)|{\s*$(?!\r?\n?\s*\*\/)).+^\s*}(?!\r?\n?\s*\*\/))/ms', $c, $match);

I recommend using 3).

/**
 * The LightTaskSchedulerService class. :{
 */
class LightTaskSchedulerService implements TaskSchedulerService {
{
    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");
        if ($foo) {
            doBar($foo);
        }
        /* multiline */
        // simple one line comment
        // simple one line comment { }
        # another comment
        # another comment}} {
        # another comment{/*}*/
//}
#}
/*}*/
/*{*/
/*
}*/
/*
}
*/
    }
}


// this can happen in comments:}, why
// more stuff
/* multiline hello} hello{
}*/
# singleline{
#}
//}
/*}*/
/**
}*/

Output:

Array
(
    [0] => {
{
    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");
        if ($foo) {
            doBar($foo);
        }
        /* multiline */
        // simple one line comment
        // simple one line comment { }
        # another comment
        # another comment}} {
        # another comment{/*}*/
//}
#}
/*}*/
/*{*/
/*
}*/
/*
}
*/
    }
}
)

Your code does not work, because it has errors:

  1. Unknown modifier g (for preg_match) => use preg_match_all instead
  2. $c in your code does not work, since it is not in the php scope write: <?php $c = <<<'EEE' ... instead
  3. The look behind in your case did not work, since you can't use +*? modifiers.

References:

On php.net 'g' is not listed as an option.
Modifier 'g': preg_match_all

I don't think that you even need preg_match_all a simple preg_match should work, since you only need this one match anyway.

This should work (tested with PHP 7.0.1). It does for me:

preg_match('/^class\s+\w+\s*({.+(?<! )})/ms', $c, $match);
// or:
preg_match('/^class[^{]+({.+(?<! )})/ms', $c, $match);
// or even:
preg_match('^{.+\r?\n}(?<! )/ms', $c, $match);

print_r($match);

The negative look behind in my regex checks for leading whitespace that is followed by } in this case - the closing bracket needs to be at the very left corner in this case. This will work unless you want it to be in a different way. You need a delimiter anyway. And also you don't want that a closing curly bracket of an if-statement inside your run() method ends the search.

print_r output $match for the first preg_match statement above:

Array
(
    [0] => class LightTaskSchedulerService
{

    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");
        if ($foo) {
            doBar($foo);
        }
    }
}
    [1] => {

    /**
     *
     * This method IS the task manager.
     * See the @page(Light_TaskScheduler conception notes) for more details.
     *
     */
    public function run()
    {
        $executionMode = $this->options['executionMode'] ?? "lastOnly";
        $this->logDebug("Executing run method with execution mode \"$executionMode\".");
        if ($foo) {
            doBar($foo);
        }
    }
}
)
F. Müller
  • 3,969
  • 8
  • 38
  • 49
  • Unfortunately this doesn't work either, since you still have the "// this can happen in comments: }" string in your match, and the regex I'm looking for is one that exclude those comments from the matches. – ling Jul 23 '20 at 10:01
  • @ling This is the regex that you provided. Use mine instead and it does not do that. I will edit it. Just a sec. – F. Müller Jul 23 '20 at 10:01
  • @ling I have updated my solution. Also, I took care of other curly brackets inside the code so it won't break the search. – F. Müller Jul 23 '20 at 10:51
  • thanks for the update, however I cannot accept this as the answer since it works only for the specific string I provided, but it doesn't for all the php classes (which is what my question was about). I used your third regex (./^{.+(?<! )}/ms ), since the first one didn't take into account that a class can extend another one. I found a problem in that third regex: If I replace the "// this can happen in comments: }" string with "// this can happen in comments:}", then it will match the comment again. – ling Jul 23 '20 at 11:39
  • @ling As I stated... you need a delimiter somehow. Why would you put brackets in the comments anyway? Well, I can have a look at the other things. Besides your question could be more on point. It is not 100% clear what is required tbh. Can you update your question to make it more clear? – F. Müller Jul 23 '20 at 11:44
  • @ F.Müller: I've updated my answer, and I ended up not using regex, so to me this problem is over. However for the sake of finding the regex, well that's the whole point that in the comments you can put any string you like, otherwise this regex would be too easy, wouldn't it. I'm still going to accept any answer that finds that regex (if it exists), but it has to exclude all possible comments. But as I said, I've already moved on, so don't worry too much about it, unless it's for your own curiosity. Cheers. – ling Jul 23 '20 at 12:02
  • @ F.Müller: I've tried your new regex, seems good with comment, however it breaks if with the first line you do this "class LightTaskSchedulerService {": the opening bracket on the same line as the class declaration. – ling Jul 27 '20 at 05:20
  • @ F. Müller, you probably didn't test your last regex, because I don't have the output you show. From my test your regex matches the very first bracket in the top comment (which it shouldn't). Please don't hurt yourself too much with this, as regex is probably not the best tool to solve this. – ling Jul 27 '20 at 08:52
  • @ling It was a copy & past mistake. Sorry for that. Anyway, this is my last attempt I guess. There should be only one restriction to the regex now - you can't put code on the same line as the opening { from the class. However, if you include the class part it might just work for any case. I added both ways. I think the 2nd one might be better. – F. Müller Jul 27 '20 at 10:55
  • Well, given the restriction, it seems to me that it works then (just tried the first regex). I will accept your answer. Thanks for your time. – ling Jul 27 '20 at 11:33
  • @ling Thanks. You may want to try solution 3) should work for all cases. The only difference is that the result you want is $match[1] instead of $match[0]; but it seems to work without the restriction even. – F. Müller Jul 27 '20 at 11:46