preg_match negative lookbehind

Question

I'm trying to parse PHPDoc tags with preg_match, but I'm having some issue with negative lookbehind. I've never used those before, but it is my understanding that they're used as exclusions.

Here is my pattern:

/\*\*.+?(?<! \*/)@access public.+? \*/\s+?function\s+[a-zA-Z0-9_]+\(

Here is my sample PHP file I'm trying to parse:

<?php

/**
 * This is the shortcut to DIRECTORY_SEPARATOR
 */
defined('DS') or define('DS',DIRECTORY_SEPARATOR);

/**
 * Foo
 * 
 * @return bool
 * @access public
 */
function foo()
{
    return true;
}

I want to match any function with an @access public tag, but in this case the match starts at the DS constant's comment. I thought the (?<! \*/) would exclude it matching the closing comment tag of the DS comment.

What am I missing?

Out of curiosity, would it be more robust to use the [PHP Tokenizer](http://www.php.net/manual/en/book.tokenizer.php) to get the doc comments, then just regex the contents of the comment? — bishop, Jan 29 '14 at 01:25
Shorter, probably not, but I think more bullet proof and maintainable. See @CasimiretHippolyte answer for a great example of what I was talking about. Of course, if you have something that works now... well, "if it ain't broke, don't fix it..." — bishop, Jan 29 '14 at 12:06
@bishop, I looked more into token_get_all(), but once you boil it down to looping through the tokens looking for T_DOC_COMMENT, followed by if statements for T_PUBLIC, T_STATIC, T_FUNCTION, function name, T_VARIABLE, checking for T_WHITESPACE etc, you're doing EXACTLY what the regex is doing. So I wouldn't call it more robust, easier, or shorter. In the end, the pattern I have (below) works great and spit out a nice array of exactly what I need. — Sarke, Jan 30 '14 at 00:52
Yep, but I believe the tokenized approach to be more maintainable. Those regex are monsters! However, what works works, period. Glad you got the solution you need. — bishop, Jan 30 '14 at 01:23

score 3 · Accepted Answer · edited May 23 '17 at 12:03

Following the link by @bishop, I found an example using negative lookahead that works for me.

I changed

.+?(?<! \*/)

to

(?:(?! \*/).)+?

So the full pattern is now:

/\*\*(?:(?! \*/).)+?@access public.+? \*/\s+?function\s+[a-zA-Z0-9_]+\(

EDIT:

Full pattern that also matches function types and parameters:

(?<full>[\t ]*?/\*\*(?:(?! \*/).)+?@access public(?:(?! \*/).)+? \*/\s+?(?:public |protected |private )??(?:static )??function\s+[a-zA-Z0-9_]+\(.*?\))

And class matching:

(?<full>(?<indent>[\t ]*?)/\*\*(?:(?! \*/).)+?@access public.+? \*/\s+?(?:abstract )??class\s+[a-zA-Z0-9_]+\s??.*?{)

score 0 · Answer 2 · answered Jan 29 '14 at 01:26

0

A negative lookbehind must be of fixed length. It sounds like you would be better served using some sort of DocBlock parser. There are numerous solutions available.

answered Jan 29 '14 at 01:26

Mike Brant

70,514
10
99
103

1

[Sometimes you can work around that limitation](http://stackoverflow.com/questions/11640447/regexps-variable-length-lookbehind-assertion-alternatives), though. – bishop Jan 29 '14 at 01:27
@bishop My favorite is reversing the string and using a look behind. +1 – Ohgodwhy Jan 29 '14 at 01:28
1

@Ohgodwhy: (: !oot ineM – bishop Jan 29 '14 at 01:29
Note that his lookbehind has a fixed length. – Casimir et Hippolyte Jan 29 '14 at 01:49

Casimir et Hippolyte · Answer 3 · 2014-01-29T04:20:36.657

With the token_get_all() function:

$tokens = token_get_all($code);
$result = array();

foreach ($tokens as $k=>$token) {
    switch ($token[0]):
        case T_DOC_COMMENT:
            $isPublic = strpos($token[1], '@access public');
            break;

        case T_FUNCTION:
            $isFunction = true;
            break;

        case T_WHITESPACE:
            break;

        case T_STRING:
            if ($isFunction && $isPublic) $result[] = $token[1];

        default:
            $isFunction = false;
    endswitch;
}    

print_r($result);

To have an idea of what you can extract with the tokenizer, I suggest you to put the following code in the foreach loop, under the endswitch;:

if ($isPublic && isset($token[1]))
    printf("%s\t%s\t%s\n", $token[0],
                           token_name($token[0]),
                           strtr($token[1], "\n", ' ')
                           );

Thank you, but I need to capture the comment, function type (e.g. public static) and function parameters as well. — Sarke, Jan 29 '14 at 03:55
@Sarke: see my edit, I think that with these elements you will be able to find what you want. — Casimir et Hippolyte, Jan 29 '14 at 04:21

preg_match negative lookbehind

3 Answers3