1

I'm trying to scape a webpage using Laravel, Goutte, and Guzzle. I'm trying to pass an instance of guzzle into Goutte but my web server keeps trying to use Symfony\Contracts\HttpClient\HttpClientInterfac. Here's the exact error I'm getting:

Argument 1 passed to Symfony\Component\BrowserKit\HttpBrowser::__construct() must be an instance of Symfony\Contracts\HttpClient\HttpClientInterface or null, instance of GuzzleHttp\Client given, called in /opt/bitnami/apache/htdocs/app/Http/Controllers/ScrapeController.php on line 52

Where line 52 is referring to this line: $goutteClient = new Client($guzzleclient);

Here's my class. How can I force it to use Goutte instead of Symfony?

Changing the line to this: $goutteClient = new \Goutte\Client($guzzleclient); does not fix it.

<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Goutte\Client;
use GuzzleHttp\Cookie;
use GuzzleHttp\Client as GuzzleClient;

class ScrapeController extends Controller
{
    public function index()
    {
        return view(‘index’);
    }
    public function scrape() {
        $url = ‘www.domain.com;
        $domain = ‘www.domain.com’;


        $cookieJar = new \GuzzleHttp\Cookie\CookieJar(true);

        // get the cookie from www.domain.com
        $cookieJar->setCookie(new \GuzzleHttp\Cookie\SetCookie([
            'Domain'  => “www.domain.com”,
            'Name'    => ‘_name_session',
            'Value'   => ‘value’,
            'Discard' => true
        ]));
        $guzzleClient = new \GuzzleHttp\Client([
            'timeout' => 900,
            'verify' => false,
            'cookies' => $cookieJar
        ]);
        $goutteClient = new Client($guzzleClient);

        $crawler = $goutteClient->request('GET', $url);
        $crawler->filter('table')->filter('tr')->each(function ($node) {
            dump($node->text());
        });
    }
}
kryz
  • 79
  • 11

2 Answers2

1

You cannot pass it a GuzzleClient, it does not support accepting that.

The error is clear in telling you that the Goutte\Client must take an instance of Symfony\Contracts\HttpClient\HttpClientInterface or null; you cannot give it a GuzzleHttp\Client.

Handling Cookies in the Symfony client would need to follow this; https://symfony.com/doc/current/http_client.html#cookies.

James
  • 15,754
  • 12
  • 73
  • 91
1

Here's a fun little observation, Gouette\Client is now simply a thin extension of Symfony\Component\BrowserKit\HttpBrowser, so based on that you can modify your scrape function to be something like:

use Symfony\Component\BrowserKit\Cookie;
use Symfony\Component\BrowserKit\CookieJar;
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

...

public function scrape() {
  $url = 'http://www.example.com/';
  $domain = 'www.example.com';

  $jar = new CookieJar();
  $jar->set(new Cookie('_name_session', 'value', null, null, $domain));
  $client = HttpClient::create([
    'timeout' => 900,
    'verify_peer' => false
  ]);
  $browser = new HttpBrowser($client, null, $jar);

  $crawler = $browser->request('GET', $url);
  $crawler->filter('div')->filter('h1')->each(function ($node) {
    dump($node->text());
  });
}

In your composer.json you'll need to have requires similar to the following:

"symfony/browser-kit": "^5.3",
"symfony/css-selector": "^5.3",
"symfony/http-client": "^5.3"

but fabpot/goutte required all them anyway, so there won't be any libraries downloaded in addition to what you already have.

msbit
  • 4,152
  • 2
  • 9
  • 22
  • Wow that's a much better solution! Just showing me a blank screen though. Any ideas? – kryz Sep 01 '21 at 04:11
  • Ensure that the crawler filters are as you expect (eg `table` and `tr` as opposed to `div` and `h1` from my code), and that `dump` also does what you expect. I think you'd need to return the body from `scrape` if you are using it as a routing method (akin to `index`)? – msbit Sep 01 '21 at 05:02
  • This is proving difficult. Definitely had to remove dump https://stackoverflow.com/a/53660016/10373009 Just clarifying this works with session cookies? I'm trying to scrape a login-locked page using my own session cookie – kryz Sep 01 '21 at 16:57
  • There could be a few things tied up in that, the target server could act differently due to the user agent or the cookie could be mapped to IP on the target server, etc. Can you make the request with curl from the command line? If so, try repeating on the server hosting your Laravel site to confirm. As for session cookies, by my understanding it should work, any long lived cookie storage on the client side is dropped due to creating a new jar each request. – msbit Sep 01 '21 at 22:45
  • So after generating a valid cookie using my current session and making the request, the HTML returned is the HTML from the "You must login to view this page." Like you said, perhaps it has to do with the website generating a new cookie every time the page is refreshed or requested. Anyways I managed to use HttpBrowser to manually fill and submit the form instead of using a session cookie – kryz Sep 02 '21 at 02:21
  • Sounds like that may be the most robust way going forward. It would be interesting to see whether that cookie, added to the cookie jar when logging in via `HttpBrowser`, works in subsequent runs. – msbit Sep 03 '21 at 01:20