regex to get part of url from html tsring

Question

I'm dealing with a full html document, and I need to extract the urls but only if matches the required domain

<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>

from that junky part I would need to get the full url of example.com, but not the rest (notexample.com). that would be "http://example.com/foo/bar" or even better, only the last part of that url (bar) witch of course would be different each time.

Hope I've been clear enough, thanks a lot!

Edit: using php

You'll at least have to specify the language. Besides,I don't think regex is the easiest solution, try checking whether the string just contains "example.com", which a lot of languages support. — MarioDS, Apr 19 '12 at 13:44
Never parse html with regular expressions. I'll refer you to [this beautiful answer](http://stackoverflow.com/a/1732454/236660) for details. — Dmytro Shevchenko, Apr 19 '12 at 13:46

score 1 · Accepted Answer · answered Apr 19 '12 at 14:24

Regex is something you must avoid for parsing HTML like this. Here is a DOM parser based code that will get what you need:

$html = <<< EOF
<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->attributes->getNamedItem('href')->nodeValue;
    if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m)) 
       echo "$m[1]\n"; // prints bar
}

regex to get part of url from html tsring

1 Answers1