Webscraping Symfony/Panther: Can't get HTML

Question

I want to scrape a site with the symfony panther package within a Laravel application. According to the documentation https://github.com/symfony/panther#a-polymorphic-feline I cannot use the HttpBrowser nor the HttpClient classes because they do not support JS.

Therefore I try to use the ChromClient which uses a local chrome executable and a chromedriver binary shipped with the panther package.

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');
dd($crawler->html());

Unfortunately, I only receive the empty default chrome page as HTML:

<html><head></head><body></body></html>

Every approach to do something else with the $client or the $crawler-instance leads to an error "no nodes available".

Additionally, I tried the basic example from the documentation https://github.com/symfony/panther#basic-usage --> same result.

I'm using ubuntu 18.04 Server under WSL on Windows and installed the google-chrome-stable deb-package. This seemed to work because after the installation the error "the binary was not found" does not longer occur.

I also tried to manually use the executable of the Windows host system but this only opens an empty CMD window always reopened when closing. I have to kill the process via TaskManager.

Is this because the Ubuntu server does not have any x-server available?
What can I do to receive any HTML?

it has nothing to do with solving this problem, but try to investigate this one https://github.com/spatie/crawler — Odin Thunder, May 13 '20 at 12:34
Did you check this: https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-doesnt-launch-on-unix ? According to the documentation it is possible to run it in headless mode. You can see the required packages. — emul, May 15 '20 at 15:01
I have the same issue. When I look for a locally hosted website it returns the html but when I try an external webpage I get the same result. Did you find a solution? — kevinabraham, Jan 14 '21 at 05:35

score 2 · Answer 1 · edited Sep 19 '20 at 11:40

2

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');

/**
* Get all Html code of page
*/

$client->getCrawler()->html();

/**
* For example to filter field by ID = AuthenticationBlock and get text
*/

$loginUsername = $client->getCrawler()->filter('#AuthenticationBlock')->text();

edited Sep 19 '20 at 11:40

Dharman

30,962
25
85
135

answered Sep 18 '20 at 08:25

Mmx

358
4
18

2

I had the same problem and try to find answer, but no one share this solution. Readme file do not describe how to use Panther at all. Answer is simple when you know how it works. – Mmx Sep 19 '20 at 11:12
I was commented code, it is not enough? Please give me advise – Mmx Sep 19 '20 at 12:48
1

The more explanation you give the better. In this case, I believe your comments should be enough, but for future I recommend to explain your solutions so that they are more easily understandable. – Dharman Sep 19 '20 at 12:59

score 2 · Answer 2 · answered Apr 27 '22 at 18:49

So, I'm probably late, but I got the same problem with a pretty easy solution: Just open a simple crawler with the response content.

This one differs from the Panther DomCrawler especially in methods, but it is is safer to evaluate HTML structures.

$client = Client::createChromeClient();
$client->request('GET', 'http://example.com');

$html = $client->getInternalResponse()->getContent();
$crawler = new Symfony\Component\DomCrawler\Crawler($html);

// you can use following to get the whole HTML
$crawler->outerHtml();

// or specific parts
$crawler->filter('.some-class')->outerHtml();

Webscraping Symfony/Panther: Can't get HTML

2 Answers2