1

This is a project that I do in my free time to help reduce my time spent doing repetitive clicking in my company, so I hope it is not offending or prohibited.

Page to be scraped

Preview of the page

I only want the URI of the second link, because it is the exact search match. The first one contains also the -V1331 suffix.

Wrong:

<a href="http://pdb2.turck.de/en/DE/products/0000000000011ba40002003a">
    <strong> Product&nbsp;BI1-EH04-AP6X-V1331</strong> (HTML, 48.7K)<br>
    Product&nbsp;<strong>BI1-EH04-AP6X-V1331</strong> 
    Click to enlarge Inductive sensor Order number: &nbsp;4608440 Smooth barrel, Ø 4 mm Stainless steel, 1.4427 SO DC 3-wire, 10…30 VDC NO contact, PNP
</a>

Right:

<a href="http://pdb2.turck.de/en/DE/products/000000000001ecee0003003a">
    <strong> Product&nbsp;BI1-EH04-AP6X</strong> (HTML, 48.6K)<br>
    Product&nbsp;<strong>BI1-EH04-AP6X</strong> 
    Click to enlarge Inductive sensor Order number: &nbsp;4609540 Smooth barrel, Ø 4 mm Stainless steel, 1.4427 SO DC 3-wire, 10…30 VDC NO contact, PNP output
</a>

I have tried this:

$search = 'BI1-EH04-AP6X';
$crawler = Goutte::request('GET', 'http://www.turck.de/en/search.php?q_simple=' . $search);
return $crawler->selectLink(' Product&nbsp;' . $search)->link()->getUri();

However, this obviously fails because there is a lot of HTML inside the <a> element, thus no link is matched.

Please do not be confused with Laravel's Goutte facade, it is the Symfony Dom Crawler method.

How to obtain the URI of the second link? Is there a method that matches a link, if it contains search HTML snippet (in our case > Product&nbsp;BI1-EH04-AP6X<) ?

peter.babic
  • 3,214
  • 3
  • 18
  • 31

1 Answers1

0

I have found the answer by experimenting with the XPath helper extension and information from SO page mentioned below.

Locating the node by value containing whitespaces using XPath

$search = 'BI1-EH04-AP6X';
$crawler = Goutte::request('GET', 'http://www.turck.de/en/search.php?q_simple=' . $search);
$crawler->filterXPath('//strong[normalize-space(text())="' . $search . '"]')->each(function ($node) {
print $node->parents()->link()->getUri()."\n";
});

It does need some more optimization, but for now it is allright.

Community
  • 1
  • 1
peter.babic
  • 3,214
  • 3
  • 18
  • 31