-1

I have a folder with some classes and another with some functions.

Usually one class or function per file, but that is not always the case.

On a few occasions a class might be accompanied with a function or two and some functions might be grouped together.

I am reading each file and building a nice manual from the detailed comments each of them have.

I was thinking it would be nice to grab the code of the class or function as well.

But I have not found a way to do so.

Regular expressions are out of the question since they could only match simple functions.

I have found the PHP Tokenizer but I can't figure it out how could that help.

Google is no help also.

I am looking for a pure PHP solution if one exists.

Let's say I have a code like this:

class BaseClass {
   function __construct() {
       print "In BaseClass constructor\n";
   }
}

class SubClass extends BaseClass {
   function __construct() {
       parent::__construct();
       print "In SubClass constructor\n";
   }
}

class OtherSubClass extends BaseClass {
    // inherits BaseClass's constructor
}

function monkey() {
    return new BaseClass();
}

function weasel() {
    return new SubClass();
}

function dragon() {
    return new OtherSubClass();
}

I want to parse it and get an array of 6 entries, one with each class and one with each function.

transilvlad
  • 13,974
  • 13
  • 45
  • 80
  • And what have you tried so far and where are you having problems at (post some code) ? – Prix Jun 15 '13 at 11:06
  • No code so far. Regex is useless does not work at all. Tokenizer does not even come close to what I need. I have been looking into his for two days now but haven't got far, sorry. – transilvlad Jun 15 '13 at 11:21
  • Regex is useless; you can't use it to recognized nested structures. The tokenizer isn't useless; you just need a lot of machinery on top of it to build a real parser that you probably aren't willing to write. – Ira Baxter Jun 15 '13 at 15:25
  • EDIT: your example. And what happens if the file contains HTML or a string that contains stuff that looks like a function call? If you don't mind it being unreliable, you can hunt for the keywords 'class', 'function', and '{' '}' and simply count nesting. (Here's a place the tokenizer is directly useful). If you want it to be reliable, you probably need a parser. – Ira Baxter Jun 15 '13 at 16:42
  • I'm building this for my own codes, not general use. I write what I'd say is tidy code. – transilvlad Jun 15 '13 at 16:48

2 Answers2

1

What you need is basically a parser, so that you can pick out structures of interest. Then you either use the position information such a parser gathers (if it is well designed), to determine the boundaries of the text in your file to extract that structure, or you "prettyprint" the AST of the parsed structure to get your artifact.

NikiC describes his search and eventual construction of one such parser in PHP in this SO question. There are other solutions provided there, including mine, but it isn't in PHP.

You may have some trouble picking out the exact function you want. Imagine you have a file with two classes C1 and C2, each containing a method named M. Now to select the "right method", you need to have the full path C1::M available, and you need to check that the method M is found in the right class C1. You can do this by walking up the parse tree from M. If you have traits, this might get harder, as a method M might be defined in a trait, and then integrated into a class definition. To do this really right, you need name resolution for PHP.

If you go that far, you might need to abuse Hip Hop (the PHP-to-C compiler) into extracting what you want, under the assumption that it likely builds ASTs and full symbol tables in a usable form. (I don't know if it in fact does that).

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • I do not want to get the code a a function of a class. All I want is to get all classes and independent functions from the code. Nothing fancy. My goal is to attach the source code of a class or a function to the manual page generated from the comments of the classes and functions. technically speaking I already know what classes and functions I want to get. – transilvlad Jun 15 '13 at 15:39
  • That's not what you said: " it would be nice to grab the code of the class or function". I assume by that you wanted the *text* of such entities. To get the text, you have to find the boundaries of each entity. Because PHP class and function definitions are complex, including embedded HTML, a parser would serve you well and these things exist. You can try to hack it yourself from the token stream, but I suspect you'll discover you are re-inventing a parser one inch at a time, and surely you have better things to do. – Ira Baxter Jun 15 '13 at 16:23
  • Most parsers I looked at do a lot of nice things, but I have had problems finding a functionality that does what I need. – transilvlad Jun 15 '13 at 16:36
  • Pretty much everything off the shelf doesn't do what anybody needs. Did you check NikiC's parser? The key trick for you is the parser needs to collect position information, so you know where the start and end of entities (classes,functions) are. – Ira Baxter Jun 15 '13 at 16:39
  • So I built a parser that works surprisingly better then I ever would have expected. Just one issue has given me issues. The presence of curly brackets in strings. – transilvlad Jun 16 '13 at 03:18
  • "issues ... curly brackets in strings". (Wait till you find out that somebody has coded a curly bracket as an octal escape). As I said, if you don't care if it works perfectly, you can hack it whatever way you like. If you do care, by the you get it to work perfectly on arbitrary code, you'll end up building a full parser. Since I don't know what limitations you'll accept, your answer will vary according to your tastes. – Ira Baxter Jun 24 '13 at 18:47
0
<?php

/**moDO(Classes)(Parsers)(parse_php)

  @(Description)
    Parses php code and generates an array with the code of the classes and stand-alone functions found.

  @(Description){note warn}
    Curly brackets outside the code structures can break the parser!


  @(Syntax){quote}
    object `parse_php` ( string ~$path~ )


  @(Parameters)
  @(Parameters)($path)
    Path to the php file to be parsed.


  @(Return)
    Returns an object that contains a variable `parsed` that contains the resulting array.


  @(Examples)
  @(Examples)(1){info}
    `Example 1:` Basic usage example.

  @(Examples)(2){code php}
  $parser = new parse_php(__FILE__);
  print_r($parser->parsed);


  @(Changelog){list}
   (1.0) ~Initial release.~

*/

  /**
  * Parses php code and generates an array with the code of the classes and stand-alone functions found.
  * Note: Curly brackets outside the code structures can break the parser!
  * @syntax new parse_php($path);
  * @path string containing path to file to be parsed
  */
  class parse_php {
    public $parsed = false;

    /**
    * Validates the path parameter and starts the parsing.
    * Once parsing done it sets the result in the $parsed variable.
    * @path string representing valid absolute or relative path to a file.
    */
    function __construct($path) {
      if(is_file($path)) {
        $this->parsed = $this->load($path);
      }
    }

    /**
    * This loads prepares the contents for parsing.
    * It normalizes the line endings, builds lines array and looks up the structures. 
    * @path string representing valid absolute or relative path to a file.
    */
    private function load($path) {
      $file   = file_get_contents($path);
      $string = str_replace(Array("\r\n", "\r", "\n"), Array("\n", "\n", "\r\n"), $file);
      $array  = explode("\r\n", $string);

      preg_match_all('/((abstract[ ])?(function|class|interface)[ ]+'
                    .'[a-z_\x7f-\xff][a-z0-9_\x7f-\xff]+[ ]*(\((.+)?\)[ ]*)?)'
                    .'([ ]*(extends|implements)[ ]*[a-z_\x7f-\xff]'
                    .'[a-z0-9_\x7f-\xff]+[ ]?)?[ ]*(\r|\n|\r\n)*[ ]*(\{)/i'
                    , $string
                    , $matches);

      $filtered = Array();
      foreach($matches[0] AS $match) {
        list($first, $rest) = explode("\r\n", $match, 2);
        $filtered[] = $first;
      }

      return $this->parse($array, $filtered);
    }

    /**
    * The parser that loops the content lines and builds the result array.
    * It accounts for nesting and skipps all functions that belong to a class.
    * @lines array with the lines of the code file.
    * @matches array containing the classes and possible stand-alone functions to be looked up.
    */
    private function parse($lines, $matches) {
      $key        = false;
      $track      = false;
      $nesting    = 0;
      $structures = Array();

      foreach($lines AS $line) {
        if($key === false)
         $key = $this->array_value_in_string($line, $matches);

        if($key !== false) {
          if($nesting > 0)
           $track = true;

          $nesting = $nesting + substr_count($line, ' {');
          $nesting = $nesting - substr_count($line, ' }');

          $structures[$key][] = $line;

          if($track && $nesting == 0) {
            $track = false;
            $key   = false;
          }
        }
      }

      return array_values($structures);
    }

    /**
    * Checks if any of the (array)needles are found in the (string)haystack.
    * @syntax $this->array_value_in_string($string, $array);
    * @haystack string containing the haystack subject of the search.
    * @needles array containing the needles to be searched for.
    */
    private function array_value_in_string($haystack, $needles) {
      foreach($needles AS $key => $value) {
        if(stristr($haystack, $value))
         return $key;
      }
      return false;
    }
  }

  /**
  * Example execute self
  */
  header('Content-type: text/plain');
  $parser = new parse_php('test.php');
  print_r($parser->parsed);
transilvlad
  • 13,974
  • 13
  • 45
  • 80