12

Issue:
Cannot fully understand the Goutte web scraper.

Request:
Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form?

Webpage Layout attempting to be scraped:
Step 1:
The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is defaulted to Name with First and Last Name textboxes along with a State drop down menu select list. If you choose Radio there is jQuery or JavaScript that makes the First and Last Name textboxes go away and a License Textbox appears.

Step 2:
Once you have successfully submitted the form then it brings you to a page that has multiple links. We can go in to one of two of them to get our information we need.

Step 3:
Once we have successfully clicked on the link we want the third page has the data that we are looking for and we want to store that data into a php variable.

Submitting Incorrect information:
If wrong information is submitted then a jQuery/Javascript returns a message of "No records were found." on the same page as the submission.

Note:
The preferred method would be to select the license radio button, fill in the license number, choose the state and then submit the form. I have read tons of posts and blogs and other items about Goutte and nowhere can I find what options are available for Goutte, how you find out this information or how to use this information if it did exist.

Kara
  • 6,115
  • 16
  • 50
  • 57
scrfix
  • 1,188
  • 3
  • 11
  • 24
  • Perhaps this question needs to be more specific? At the moment it is very general, and so hard to answer. If the problem is that JavaScript is not running in Goutte, then that would be correct - you'd need to run a proper browser for that. Headless webkit would do that for you. – halfer Jun 18 '13 at 23:35

2 Answers2

18

The documentation you want to look at is the Symfony2 DomCrawler.

Goutte is a client build on top of Guzzle that returns Crawlers every time you request/submit something:

use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'http://www.symfony-project.org/');

With this crawler you can do stuff like get all the P tags inside the body:

$nodeValues = $crawler->filter('body > p')->each(function (Crawler $node, $i) {
    return $node->text();
});
print_r($nodeValues);

Fill and submit forms:

$form = $crawler->selectButton('sign in')->form(); 
$crawler = $client->submit($form, array(
        'username' => 'username', 
        'password' => 'xxxxxx'
));

A selectButton() method is available on the Crawler which returns another Crawler that matches a button (input[type=submit], input[type=image], or a button) with the given text. [1]

You click on links or set options, select check-boxes and more, see Form and Link support.

To get data from the crawler use the html or text methods

echo $crawler->html();
echo $crawler->text();
Onema
  • 7,331
  • 12
  • 66
  • 102
0

After much trial and error I have discovered that there is a much easier, well documented, better assitance (if needed) and much more effective scraper than goutte. If you are having issues with goutte try the following:

  1. Simple HTML Dom: http://simplehtmldom.sourceforge.net/

If you are in the same situation as I was where the page you are trying to scrape requires a referrer from their own website then you can use a combination of CURL and Simple HTML DOM because it does not appear that Simple HTML DOM has the ability to send a referrer. If you do not need a referrer then you can use Simple HTML DOM to scrape the page.

$url="http://www.example.com/sub-page-needs-referer/";
$referer="http://www.example.com/";
$html=new simple_html_dom(); // Create a new object for SIMPLE HTML DOM
/** cURL Initialization  **/
$ch = curl_init($url);

/** Set the cURL options **/
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER,$referer);
$output = curl_exec($ch);

if($output === FALSE) {
  echo "cURL Error: ".curl_error($ch); // do something here if we couldn't scrape the page
}
else {
  $info = curl_getinfo($ch);
  echo "Took ".$info['total_time']." seconds for url: ".$info['url'];
  $html->load($output); // Transfer CURL to SIMPLE HTML DOM
}

/** Free up cURL **/
curl_close($ch);

// Do something with SIMPLE HTML DOM.  It is well documented and very easy to use.  They have a lot of examples.
scrfix
  • 1,188
  • 3
  • 11
  • 24
  • 4
    Goutte is trying to do a fair bit more than this, from what I can tell: clicking links, following redirects, submitting forms, and so forth - essentially emulating a browser. – halfer Jun 18 '13 at 23:31
  • 1
    Thanks. It is not the ability of Goutte that was being questioned though. It's lack of documentation for how to properly use it. I tried and tried and tried to use it and just couldn't figure it out. Simple HTML DOM was a snap. After failure after failure and seeking help that never came on Goutte I didn't even need to ask for help on Simple HTML DOM and only needed to read a small portion of the documentation to figure it out. – scrfix Jul 07 '13 at 10:52
  • I've only done a bit of Goette, so difficult for me to say whether the docs are any good at this point. Are you using an autocompleting IDE, out of interest? If not, it will make your life much easier - I expect it would have been much harder if it wasn't for Netbeans. – halfer Jul 07 '13 at 11:09
  • 1
    Goutte is just a thin wrapper on top of other tools. If you want to scrape look at the Symfony Scraper documentation, it is extensive and there are lots of examples. – Onema Dec 15 '13 at 23:45
  • Indeed Goutte is just a wrapper for [DomCrawler](http://symfony.com/doc/current/components/dom_crawler.html) component by Symphony'. And the [CssSelector](https://symfony.com/doc/current/components/css_selector.html) component. Perhaps looking the documentation will help you understand. I find it quite useful as you can also run Xpath queries on the DOM. Its so simple to fetch text or raw html. This allows me to combine Xpath and CSS selections to perform very accurate crawls. – jsrosas Nov 23 '16 at 04:13