DDD: Modelling a scraping app

Question

I'm trying to design a scraping application using bdd, ddd and oop. The purpose of this app is to check if a page is up or not, if it stil contains certain elements or not, like links, images etc.

Using BDD, writting my scenarios, I came up with classes like Page, Link, Image, etc having properties like url, src, alt.

The question that I have is that I see two possibilities to check against live websites: 1. use another class, a crawler class, which would use the data contained in the previous classes and hit the web to check if the pages are up, if they contain the expected elements etc:

$crawler = new Crawler();
$page = new Page($url);

$pageReturned = $crawler->get($page);

if ($pageReturned->isUp()) {
  // continue with the checking of element...
  $image = new Image($src, $alt);

  if ($pageReturned->contains($image)) {
    // check other things
  } else {
    // image not found on the page
  }
}

have this "crawling" behaviour included in the classes themselves (which looks more like oop to me), which means that I would ask the page if it is up, if it contains a given element etc:

$page = new Page($url);

if ($page->isUp()) {
  $image = new Image($src, $alt);

  if ($page->contains($image)) {
    // check other things
  } else {
    // image not found on the page
  }
}

I'd be tempted to use #2 but I'm wondering how I could do so without having the classes tied to a certain crawling library. I'd like to be able to switch later between different libraries, like goutte or guzzle or even using directly curl.

Maybe I'm missing the point of oop altogether here... Maybe there are much better / clever ways of doing this, hence my question. :)

This is where the D in SOLID comes into play -- Dependency Inversion. A higher level class should only depend on abstractions, not concrete classes. If the various crawling implementations implement a common interface and the calling code only works against this interface -- the desired concrete implementation can be injected as needed. — dbugger, May 31 '16 at 15:33
Why bother with a complex domain model? The only true object in the problem is the `Crawler` which needs to maintain state as it conducts its behaviour of "crawling". The rest of your "objects" are just data structures. Does `Page` have states between which it transitions? Does it have behaviour that changes during its lifetime? If no, it's just a data structure to be consumed by your `Crawler`. — , Jun 01 '16 at 20:39

score 3 · Answer 1 · edited May 23 '17 at 10:28

One useful thing to realize is that your model code tends to be self contained -- it knows about data elements in the model (ie, the data graph), and the data consistency rules, but not anything else.

So your model for a page would probably look like

class Page {
    URL uri;
    ImageCollection images;
}

In other words, the model knows about the relationship between pages and images, but it does not necessarily know what those things mean in practice.

To actually compare your domain model with the real world, you pass to the model some service that knows how to do the work, but does not know the state.

class Crawler {
    void verify(URL page, ImageCollection images)
}

Now you match them together; you construct the Crawler, and pass it to the Page. The page finds its state, and passes that state to the crawler

class Page {
    void verifyWith(Crawler crawler) {
        crawler.verify(this.uri, this.items);
    }
}

Of course, you probably don't want to couple the page too closely to the Crawler; after all, you might want to swap out the crawler libraries, you might want to do something else with the page state.

So you make the signature of this method more general; it accepts an interface, rather than an object with a specific meaning. In the classic book Design Patterns, this would be an example of the Visitor Pattern

class Page {
    interface Visitor {
        void visitPage(URL uri, ImageCollection images);
    }

    void verifyWith(Visitor visitor) {
        visitor.visitPage(this.uri, this.images);
    }
}    

class Crawler implements Page.Visitor {
    void visitPage(URL page, ImageCollection images) {
        ....
    }
}

Note -- the model (page) is responsible for maintaining the integrity of its data. That means that any data it passes to a visitor should be immutable, or failing that a mutable copy of the state of the model.

In the long term, you probably wouldn't want the definition of the Visitor embedded in the Page like this. Page is part of the model's API, but the Visitor is part of the model's SPI.

interface PageVisitor {
    void visitPage(URL uri, ImageCollection images);
}

class Page {
    void verifyWith(PageVisitor visitor) {
        visitor.visitPage(this.uri, this.images);
    }
}    

class Crawler implements PageVisitor {
    void visitPage(URL page, ImageCollection images) {
        ....
    }
}

One thing that did get glossed over here is that you seem to have two different implementations of "page"

// Here's one?
$page = new Page($url);

// And here is something else?
$pageReturned = $crawler->get($page);

One of the lessons of ddd is the naming of things; in particular, making sure that you don't combine two ideas that really have separate meanings. In this case, you should be clear on what type is returned by the crawler.

For example, if you were in a domain where the ubiquitous language borrowed from REST, then you might have statements that look like

$representation = $crawler->get($resource);

In your example, the language looks more HTML specific, so this might be reasonable

$htmlDocument = $crawler->get($page)

The reason for exposing this: the document/representation fits well with the notion of being a value object -- it's an immutable bag of immutable stuff; you can't change the "page" by manipulating the html document in any way.

Value objects are purely query surfaces -- any method on them that looks like a mutation is really a query that returns a new instance of the type.

Value objects are a great fit for the specification pattern described by plalx in his answer:

HtmlSpecification {
    boolean isSatisfiedBy(HtmlDocument);
}

Thank you very much for your answer and explanation. The thing is that I have a hard time seeing, where I would get the results of the page containing a certain image for example. If I understood well, the crawler `visitPage` method would be called when calling `Page->verifyWith($crawler)` but as both those methods don't return anything where would I call a method to have a boolean giving me the result of the check / verification ? — Iam Zesh, Jun 29 '16 at 15:41
The most general answer: query the state of the crawler after the page has been verified. visitPage is a command that can modify the state of the crawler, for example adding validation errors to a list. You can then check if that list is empty. — VoiceOfUnreason, Jun 29 '16 at 15:57

plalx · Answer 2 · 2016-06-01T14:53:50.580

What about something like this? You could leverage any existing HTML parsing framework that can construct a document object model that is queryable through CSS selectors and abstract the implemention behind domain interfaces.

I also used the Specification pattern to create matching criterias for pages which would make it very easy to create new rules.

Usage:

var elementsQuery = new ElementsQuery('image[src="someImage.png"], a[href="http://www.google.com"]');
var spec = new PageAvailable().and(new ContainsElements(elementQuery, 2));
var page = pageLoader.load(url);

if (spec.isSatisfiedBy(page)) {
    //Page is available & the page contains exactly one image with the attribute src=someImage.png and one link to google
}

Some things you could do to improve the design is to create a fluent builder that allows you to generate CSS selectors (ElementsQuery) more easily.

E.g.

var elementsQuery = new ElementsQueryBuilder()
                        .match('image').withAttr('src', 'someImage.png')
                        .match('a').withAttr('href', 'http://www.google.com');

Anothing important thing if you want to eventually be able to create specifications that goes beyond validating the existence of elements through an ElementsQuery would be to expose a more powerful API to inspect the Document Object Model (DOM).

You could have something like that to replace DOM in the above design and adjust the PageSpecification API accordingly to give more power to specifications.

public interface Element {
    public String tag();
    public String attrValue(String attr);
    public boolean containsElements(ElementsQuery query, ExpectedCount count);
    public Elements queryElements(ElementsQuery query);
    public Elements children();
}

The advantage of having access to the entire DOM structure accessible from the domain rather than just asking an infrastructure service if the criterias are satisfied is that the specifications declaration and implementation can both live in the domain.

In @VoiceOfUnreason's answer the Crawler implementation must live in the infrastructure layer and while the declaration of the rules live in the domain (ImageCollection) the logic to check these rules lives in the infrastructure.

Finally, I suppose that the page monitoring entries are probably persistent and might be configurable through a UI or a config file.

What I would perhaps do is to have two different bounded contexts. One to maintain pages to monitor with their associated specification (Page is an entity in this context) and another context responsible for performing the monitoring (Page is a value in this context - using an implementation similar to what I described).

Not bad. The binding between the image and the elements query isn't quite right; a fluent builder might clean that up (but makes the example harder to understand). Also, specification needs to be able to query the state of the subject. — VoiceOfUnreason, Jun 01 '16 at 12:35
And you should include some links to essays on the pattern :) — VoiceOfUnreason, Jun 01 '16 at 12:36
Yes, a fluent builder could be better. What do you mean by "specification needs to be able to query the state of the subject"? I understand that right now the Page API (backed by the DOM API) is not rich enough to allow more complex specifications, but since these were the only 2 specifications needed for the current problem domain I thought it was better to apply YAGNI. — plalx, Jun 01 '16 at 12:50
See my update: specifications on values are a much better idea than specifications on entities. — VoiceOfUnreason, Jun 01 '16 at 14:27
@VoiceOfUnreason Well `Page` in the monitoring context is a value object right now. I haven't really worked out all the details in the answer but there could be a Management context where `Page` is an entity associated with a specification declaration, but processing it would be done in the monitoring context. — plalx, Jun 01 '16 at 14:51

DDD: Modelling a scraping app

2 Answers2