-2

Possible Duplicate:
Grabbing the href attribute of an A element

Im trying to match up in page source :

 <a href="/download/blahbal.html">

I have looked at one other link on this site and used the regex :

   '/<a href=["\']?(\/download\/[^"\'\s>]+)["\'\s>]?/i'

which returns all href links on the page but it misses off the .html on some links.

Any help would be greatly appreciated.

Thank you

Community
  • 1
  • 1

1 Answers1

1

First use the method described here to retrieve all hrefs, then you can use a regex or strpos to "filter out" those who don't start with /download/.
The reason why you should use a parser instead of a regex is discussed in many other posts on stack overflow (see this). Once you parsed the document and got the hrefs you need, then you can filter them out with simple functions.

A little code:

$dom = new DOMDocument;
//html string contains your html
$dom->loadHTML($html);
//at the end of the procedure this will be populated with filtered hrefs
$hrefs = array();
foreach( $dom->getElementsByTagName('a') as $node ) {
    //look for href attribute
    if( $node->hasAttribute( 'href' ) ) {
        $href = $node->getAttribute( 'href' );
        // filter out hrefs which don't start with /download/
        if( strpos( $href, "/download/" ) === 0 )
            $hrefs[] = $href; // store href
    }
}
Community
  • 1
  • 1
CaNNaDaRk
  • 1,302
  • 12
  • 20
  • Tested, works. strpos is easily subsistuted with a regex (preg_match) if necessary. – CaNNaDaRk Sep 01 '11 at 10:25
  • Thank you im still curious if you can do it with regex though. – Jamesmiller Sep 01 '11 at 17:57
  • It depends on which links are missing from the match, maybe the regex has just to be adjusted a little. – CaNNaDaRk Sep 01 '11 at 20:04
  • Yeah man i adjusted it and got it to work thank you :-) just facing another problem now but ill get there – Jamesmiller Sep 01 '11 at 20:40
  • I'm glad, but remember that regexes use to fail with some "unusual" hrefs attributes where DOM doesn't! – CaNNaDaRk Sep 01 '11 at 20:43
  • ah right thank you, do you know which method is faster ? – Jamesmiller Sep 01 '11 at 21:06
  • Never tested speed of the two methods, i'd bet regex can be faster but i can't swear it, maybe this can be a good point for a new post if there isn't already one opened. Usually performance difference is not so big to justify the loss of "precision" or functionalities. – CaNNaDaRk Sep 01 '11 at 22:08