0

I'm doing a script for get all the links from a website but I want to get only the links with a specific word. I have the following script and now I can get all the links and I don't know how to create a regx for search the word I want:

$url = file_get_contents("http://www.example.es");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $url,    $todosenlaces);
Marcin Nabiałek
  • 109,655
  • 42
  • 258
  • 291
  • 1
    this isalmost impossible to do correctly with regex - http://stackoverflow.com/questions/4702987/php-string-manipulation-extract-hrefs – birdspider Jul 31 '14 at 10:19
  • Where this word should be? In anchor test or in url ? – Marcin Nabiałek Jul 31 '14 at 10:22
  • I would advice you to use some library to do the heavy work. In this case you can go for the [symfony DomCrawler component](http://symfony.com/doc/current/components/dom_crawler.html) + [symfony CssSelector component](http://symfony.com/doc/current/components/css_selector.html). They are meant to work together you can use jQuery like selectors in PHP, you just need to feed the DomCrawler with the string from the webpage. – mTorres Jul 31 '14 at 10:25

2 Answers2

2

If you mean by specific word anchor text, you can use:

/<a.+href=["'](.*)["'].*>(.*(?:test|aa).*)<\/a>/isgmU

Demo

In above example all anchors are found that have word test or aa in anchor text.

If you want only anchors with specific word inside anchor you could use:

/<a[^>]+href=["']([^>]*(?:test|aa)[^>]*)["'][^>]*>(.*)<\/a>/isgmU

Demo

However those won't work in all cases but for simple matching they should work.

Marcin Nabiałek
  • 109,655
  • 42
  • 258
  • 291
0

Do something like this:

$html = file_get_contents("http://www.example.es");
$dom = new DOMDocument();
$dom->loadHTML($html);

$results = array();

$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
       $url = $tag->getAttribute('href');
       if (strpos($url,"apple") !== false){ //"apple" is the word to search for
           $results[] = $url;
       }

       //or search for the word in the hyperlink text 
       if (strpos($tag->nodeValue,"apple") !== false){
           $results[] = $url;
       }
}

$results will contain an array of all urls containing the word apple.

As birdpspider already pointed out it's not good to search links using a RegEx. The code parsing the document comes from: PHP String Manipulation: Extract hrefs.

Community
  • 1
  • 1
idmean
  • 14,540
  • 9
  • 54
  • 83