3

I need to capture all links in a given html.

Here is sample code:

<div class="infobar">
    ... some code goes here ...
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
    ... some code goes here ...
</div>

I need to get all links inside div.infobar that starts with /link/

I tried this:

preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);

but it gives me the only first match.

Thanks for advices.

Valour
  • 773
  • 10
  • 32
  • Maybe there's an html parser that will do this more easily for you? –  Jun 23 '11 at 23:33
  • I am already getting it first getting the inside of div.infobar with preg_match then getting the links with preg_match_all. but since regex offers more flexibility, why I shouldn't use it? I just need a good pattern. I want to know how to accomplish that with just 1 preg_match_all – Valour Jun 23 '11 at 23:35
  • 2
    You cannot do that with a single regex. You first need to isolate the div and then extract the desired links from it. -- What the stubby comments are about: you can extract the links easier with phpQuery or [QueryPath](http://querypath.org/) using `foreach (qp($html)->find("div.infobar a") as $a) { print $a->attr("href"); }` Using a specific regex is really only appropriate for performance reasons, if it's a known coherent html input blob. – mario Jun 23 '11 at 23:35
  • HTML is not a regular language, so it is [unwise to use a regular expression to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – sarnold Jun 24 '11 at 00:28
  • @stereofrog, fair point; there's no way I can improve upon [anubhava's answer](http://stackoverflow.com/questions/6461732/using-a-regular-expression-to-extract-urls-from-links-in-an-html-document/6461935#6461935) for this specific case, and I think a little levity is a fantastic way to show that trying to use the wrong tool for the job can lead to incredible frustration. – sarnold Jun 24 '11 at 01:51
  • @Stereofrog, [our very own Jeff Atwood](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) provides further advice that parsing a non-regular language such as HTML with a regular expression might just work most of the time, but is brittle. Yes, [newer engines called 'regular expression'](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt) can match some non-regular languages, but I still believe a match written in one of these languages will be harder to maintain over time than using a more powerful parser such as the `DOMDocument` or XPath approaches. – sarnold Jun 24 '11 at 09:11

4 Answers4

7

I would suggest using DOMDocument for this very purpose rather than using regex. Consider following simple code:

$content = '
<div class="infobar">
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);

// To hold all your links...
$links = array();

// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
  // Check the class attr of each div
  $cl = $div->getAttribute("class");
  if ($cl == "infobar") {
    // Find all hrefs and append it to our $links array
    $hrefs = $div->getElementsByTagName("a");
    foreach ($hrefs as $href)
       $links[] = $href->getAttribute("href");
  }
}
var_dump($links);

OUTPUT

array(4) {
  [0]=>
  string(15) "/link/some-text"
  [1]=>
  string(18) "/link/another-text"
  [2]=>
  string(12) "/link/blabla"
  [3]=>
  string(13) "/link/whassup"
}
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Lets see if the op still thinks regex are better :d – dynamic Jun 23 '11 at 23:50
  • what is the execution time between this and regex? I can do this with just 2 preg_match_all functions. – Valour Jun 27 '11 at 12:11
  • Execution time will be comparable (or even better) than regex based code but more importantly DOM based code will NOT break at unexpected time as compared to regex code. – anubhava Jun 27 '11 at 12:25
2

Revising my previous answer. You'll need to do it in two steps:

//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);

//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);
Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
2

http://simplehtmldom.sourceforge.net/ :

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
iHaveacomputer
  • 1,427
  • 4
  • 14
  • 30
0

Try this (I added a +):

preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);
agent-j
  • 27,335
  • 5
  • 52
  • 79