0

Introduction

I'm working on a project for scanning website on vulnerabilities threats. Therefore I need to program a Spider to index all the pages.

I'm using a combination of two libraries to program the Spider.

1) Symfony\Component\BrowserKit\Client //is a abstract class
2) mmerian\phpcrawl\PHPCrawler //is a concrete class with override function

In order to use them it is required to extends both of them because one is abstract and the other has an override function I need to make it practical.

PHP doesn't allow multiple inheritance, is there a way around this issue?

Spider.php

<?php

namespace App\Core;

use PHPCrawler; //I need to inherit this object
use PHPCrawlerDocumentInfo;

use Symfony\Component\BrowserKit\Client as BaseClient;


class Spider extends BaseClient
{

    private $url;
    private $phpCrawler;

    public function __construct($url){
        parent::__construct();

        //I have instantiated the object instead of inheriting it.
        $this->phpCrawler = new PHPCrawler;

        $this->url = $url;
    }

    public function setup(){

        $this->phpCrawler->setURL($this->url);

        $this->phpCrawler->addContentTypeReceiveRule("#text/html#"); 

        $this->phpCrawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css)$# i"); 
    }

    public function start(){

        $this->setup();

        echo 'Starting spider' . PHP_EOL;
        $this->phpCrawler->go();

        $report = $this->phpCrawler->getProcessReport();

        echo "Summary:". PHP_EOL; 
        echo "Links followed: ".$report->links_followed . PHP_EOL; 
        echo "Documents received: ".$report->files_received . PHP_EOL; 
        echo "Bytes received: ".$report->bytes_received." bytes". PHP_EOL; 
        echo "Process runtime: ".$report->process_runtime." sec" . PHP_EOL;

        if(!empty($this->phpCrawler->links_found)){
            echo 'not empty';
        }
    }

    //Override - This doesn't work because it is not inherit
    public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo){

        $this->parseHTMLDocument($pageInfo->url, $pageInfo->content);

    }

    public function parseHTMLDocument($url, $content){

        $crawler = $this->request('GET', $url);

        $crawler->filter('a')->each(function (Crawler $node, $i){
            echo $node->attr('href');
        });

    }

    //This is a abstract function
    public function doRequest($request){}

}
melkawakibi
  • 823
  • 2
  • 11
  • 26
  • 1
    `PHP doesn't allow multiple inheritance, is there a way around this issue?` Traits.... But in this case I would build a wrapper. – ArtisticPhoenix Jul 11 '17 at 15:30
  • How do you suggest I do this, I already tried using Trait but, Traits don't allow extends. – melkawakibi Jul 11 '17 at 15:31
  • Have you tried this ugly but somewhat effective solution https://stackoverflow.com/questions/356128/can-i-extend-a-class-using-more-than-1-class-in-php ? – Dave Goten Jul 11 '17 at 15:33
  • 1
    I would extend the abstract class, extend the other class, change the functionality as you need, then wrap both of them in a third class that glues them together. – ArtisticPhoenix Jul 11 '17 at 15:38
  • Doesn't the built-in [`DOMCrawler`](https://symfony.com/doc/current/components/dom_crawler.html) work for you? It's what `$crawler = $this->request('GET', $url);` returns. – apokryfos Jul 11 '17 at 16:07
  • @apokryfos I use PHPCrawler as a Spider for general tasks like indexing webpages and the DomCrawler for more advanced tasks like submitting forms. – melkawakibi Jul 11 '17 at 17:38
  • @ArtisticPhoenix I will try your suggestion, thanks everyone. – melkawakibi Jul 11 '17 at 17:40
  • @ArtisticPhoenix I have somewhat implemented your solution and it works. I have a fully functional spider /w DomCrawler and PHPCrawl. – melkawakibi Jul 12 '17 at 09:59
  • Cool glad to hear you worked it out, from a code maintenance standpoint it's better to wrap it too, say you want to change one of those libraries you only have to change what the wrapper does inside not the code outside of it. It provides an Abstraction layer. – ArtisticPhoenix Jul 12 '17 at 15:45
  • PHPQuery is another good class for HTML parsing, it let you use jquery like selectors to transverse the DOM. – ArtisticPhoenix Jul 12 '17 at 15:47
  • @ArtisticPhoenix thank you, I will check it out. – melkawakibi Jul 12 '17 at 17:25

1 Answers1

0

I have found a solution to my problem.

I have extended the abstract class (BrowserKit\Client) with its own concrete class, like so BaseClient extends Client. This makes it possible to instantiate the BaseClient in the Spider Class instead of extending it. Furthermore, the Spider Class can now be extended with PHPCrawler so that the override function handleDocumentInfo can be called.

Class structure of the solution

Core/
 - BaseClient //extends BrowserKit\Client
 - Spider //extends PHPCrawl
melkawakibi
  • 823
  • 2
  • 11
  • 26