Getting HREF of an tag with preg_match_all and curl

Question

It's been a couple of days now that I am trying to find a way to solve my problem. I use CURL to get the content of a webpage and then use prey_match_all to use the content on my style, but I've got a problem when it's time to find some < a > tags in the document.

I want preg_match_all to find all < a > tags that are followed by a < strong > tag and than store all href values of these < a > tags in a array variable.

Here's what I've thought :

preg_match_all("~(<a href=\"(.*)\"><strong>\w+<\/strong>)~iU", $result, $link);

It's returning me :

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )

Can somebody help me please !!

http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php — Paul Dessert, Sep 26 '13 at 22:05
http://php.net/preg_match_all make sure that $result and $link are the right way around... Other than that, we'd need to see some example `html` to write a `regex`... — Steven, Sep 26 '13 at 22:19

Kenny · Answer 1 · 2013-09-30T19:23:59.137

I strongly recommend you go with DomDocument

This code should do the trick...

<?php

/**
* @author Jay Gilford
* @edited KHMKShore:stackoverflow
*/

/**
* get_links()
* 
* @param string $url
* @return array
*/
function get_links($url) {

  // Create a new DOM Document to hold our webpage structure
  $xml = new DOMDocument();

  // Load the url's contents into the DOM (the @ supresses any errors from invalid XML)
  @$xml->loadHTMLFile($url);

  // Empty array to hold all links to return
  $links = array();

  //Loop through each <a> and </a> tag in the dom
  foreach($xml->getElementsByTagName('a') as $link) {
    //if it has a strong tag in it, save the href link.
    if (count($link->getElementsByTagName('strong')) > 0) {
        $links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
    }
  }

  //Return the links
  return $links;
}

Thank you so much KHMKShore ! You were right, DOMDocument was the right thing to do ! Thank you so much ! — Ariarteau, Sep 30 '13 at 03:35

score 0 · Answer 2 · answered Sep 26 '13 at 22:56

firstly, your regex can fail easily

<a alt="cow > moo" href="cow.php"><strong>moo</strong></a>

second your regex is slightly out, the following will work:

~(<a href="(.*)"><strong>\w+</strong></a>)~

thirdly, and most important, if you want to be guaranteed to extract what you want without ever failing, like @KHMKShore has pointed out, DOMDocument is the best path.

Getting HREF of an tag with preg_match_all and curl

2 Answers2