2

I'm pretty new to OOP, so please have mercy ;( . I am not even shure if the title of this post is ok.

I'm crawling some sites with Goutte, like this

$ad['title'] = $crawler->filter('#subject')->text();
$ad['image'] = $crawler->filter('.images')->filter('meta')->eq(0)->attr('content');

This is not too difficult, but i want to have reusable code. So for every site I scrape there is an $ad['title'] and an $ad['image'] The used $crawler methods differ per site, so I would like to have something like

$crawler->$filter

Where filter contains

'filter('#subject')->text()'

That way I can store the filters in the database per site. I don't know if this is possible of even is a good approach.

Harmstra
  • 433
  • 1
  • 5
  • 15

1 Answers1

3

One way to deal with your problem is to use OOP polymorphism concept. For PHP, this is explained here and in your case can be used like this (greatly simplified):

Define an abstract class for your crawlers. Each crawler must implement extend it and provide its own implementation. Of course the crawler class will embed $crawler object.

abstract class BaseCrawler
{
    protected $crawler;

    abstract protected function getTitleElement();
    abstract protected function getImageElement();

    // initialize the crawler etc.
}

class CrawlerOne extends BaseCrawler
{
    public function getTitleElement()
    {
        // get the title for crawler one
    }

    public function getImageElement()
    {
        // get the image for crawler one
    }

    // other functionality may come here
}

class CrawlerTwo extends BaseCrawler
{
    public function getTitleElement()
    {
        // get the title for crawler two
    }

    public function getImageElement()
    {
        // get the image for crawler two
    }

    // other functionality may come here
}

So, your structure is flexible, but has a common functionality.

Database is for storing data, not logic. So, if title and image can be fetched by using a simple regular expression, that can be stored in the database for each crawler. In this case, each crawler can define a constant code that can be used to perform a look up for the title and image regular expressions.

Community
  • 1
  • 1
Alexei - check Codidact
  • 22,016
  • 16
  • 145
  • 164
  • But this is exactlyt what I am trying to prevent. If i have 10, or even 100, different sites, I have to create loads of classes. I want to have 1 class with generic getTitle($filter) methods, with the filter sequence as argument. The filter is stored in the database so that I can maintain them with a form somewhere on an admin page – Harmstra Feb 27 '16 at 10:22
  • If parsing of a site is based only on one discriminant (e.g. several regular expression), you can define only a class that will fetch regular expressions from the database and perform the parsing. Otherwise, storing and dynamically executing code is asking for trouble (thing about debugging, how hard it is to perform a change whenever the site structure has changed etc.). – Alexei - check Codidact Feb 27 '16 at 17:48
  • If the parsing algorithm is more complex, having a separate class for each site should not be a problem. If each is doing its specific parsing, you are not repeating yourself (DRY) and maintenance is simple: any change in parsing of a site does not affect other parsing (less risk), easier debugging etc. – Alexei - check Codidact Feb 27 '16 at 17:52