0

I'm dealing with a full html document, and I need to extract the urls but only if matches the required domain

<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>

from that junky part I would need to get the full url of example.com, but not the rest (notexample.com). that would be "http://example.com/foo/bar" or even better, only the last part of that url (bar) witch of course would be different each time.

Hope I've been clear enough, thanks a lot!

Edit: using php

Chriszuma
  • 4,464
  • 22
  • 19
monxas
  • 2,475
  • 2
  • 20
  • 36
  • You'll at least have to specify the language. Besides,I don't think regex is the easiest solution, try checking whether the string just contains "example.com", which a lot of languages support. – MarioDS Apr 19 '12 at 13:44
  • 6
    Never parse html with regular expressions. I'll refer you to [this beautiful answer](http://stackoverflow.com/a/1732454/236660) for details. – Dmytro Shevchenko Apr 19 '12 at 13:46

1 Answers1

1

Regex is something you must avoid for parsing HTML like this. Here is a DOM parser based code that will get what you need:

$html = <<< EOF
<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->attributes->getNamedItem('href')->nodeValue;
    if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m)) 
       echo "$m[1]\n"; // prints bar
}
anubhava
  • 761,203
  • 64
  • 569
  • 643