3

I'm trying to parse PHPDoc tags with preg_match, but I'm having some issue with negative lookbehind. I've never used those before, but it is my understanding that they're used as exclusions.

Here is my pattern:

/\*\*.+?(?<! \*/)@access public.+? \*/\s+?function\s+[a-zA-Z0-9_]+\(

Here is my sample PHP file I'm trying to parse:

<?php

/**
 * This is the shortcut to DIRECTORY_SEPARATOR
 */
defined('DS') or define('DS',DIRECTORY_SEPARATOR);

/**
 * Foo
 * 
 * @return bool
 * @access public
 */
function foo()
{
    return true;
}

I want to match any function with an @access public tag, but in this case the match starts at the DS constant's comment. I thought the (?<! \*/) would exclude it matching the closing comment tag of the DS comment.

What am I missing?

Sarke
  • 2,805
  • 2
  • 18
  • 28
  • 1
    Out of curiosity, would it be more robust to use the [PHP Tokenizer](http://www.php.net/manual/en/book.tokenizer.php) to get the doc comments, then just regex the contents of the comment? – bishop Jan 29 '14 at 01:25
  • @bishop easier, shorter code? – Sarke Jan 29 '14 at 03:57
  • Shorter, probably not, but I think more bullet proof and maintainable. See @CasimiretHippolyte answer for a great example of what I was talking about. Of course, if you have something that works now... well, "if it ain't broke, don't fix it..." – bishop Jan 29 '14 at 12:06
  • @bishop, I looked more into token_get_all(), but once you boil it down to looping through the tokens looking for T_DOC_COMMENT, followed by if statements for T_PUBLIC, T_STATIC, T_FUNCTION, function name, T_VARIABLE, checking for T_WHITESPACE etc, you're doing EXACTLY what the regex is doing. So I wouldn't call it more robust, easier, or shorter. In the end, the pattern I have (below) works great and spit out a nice array of exactly what I need. – Sarke Jan 30 '14 at 00:52
  • Yep, but I believe the tokenized approach to be more maintainable. Those regex are monsters! However, what works works, period. Glad you got the solution you need. – bishop Jan 30 '14 at 01:23

3 Answers3

3

Following the link by @bishop, I found an example using negative lookahead that works for me.

I changed

.+?(?<! \*/)

to

(?:(?! \*/).)+?

So the full pattern is now:

/\*\*(?:(?! \*/).)+?@access public.+? \*/\s+?function\s+[a-zA-Z0-9_]+\(

EDIT:

Full pattern that also matches function types and parameters:

(?<full>[\t ]*?/\*\*(?:(?! \*/).)+?@access public(?:(?! \*/).)+? \*/\s+?(?:public |protected |private )??(?:static )??function\s+[a-zA-Z0-9_]+\(.*?\))

And class matching:

(?<full>(?<indent>[\t ]*?)/\*\*(?:(?! \*/).)+?@access public.+? \*/\s+?(?:abstract )??class\s+[a-zA-Z0-9_]+\s??.*?{)

Community
  • 1
  • 1
Sarke
  • 2,805
  • 2
  • 18
  • 28
0

A negative lookbehind must be of fixed length. It sounds like you would be better served using some sort of DocBlock parser. There are numerous solutions available.

Mike Brant
  • 70,514
  • 10
  • 99
  • 103
0

With the token_get_all() function:

$tokens = token_get_all($code);
$result = array();

foreach ($tokens as $k=>$token) {
    switch ($token[0]):
        case T_DOC_COMMENT:
            $isPublic = strpos($token[1], '@access public');
            break;

        case T_FUNCTION:
            $isFunction = true;
            break;

        case T_WHITESPACE:
            break;

        case T_STRING:
            if ($isFunction && $isPublic) $result[] = $token[1];

        default:
            $isFunction = false;
    endswitch;
}    

print_r($result);

To have an idea of what you can extract with the tokenizer, I suggest you to put the following code in the foreach loop, under the endswitch;:

if ($isPublic && isset($token[1]))
    printf("%s\t%s\t%s\n", $token[0],
                           token_name($token[0]),
                           strtr($token[1], "\n", ' ')
                           ); 
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125