Using a regular expression to extract URLs from links in an HTML document

Question

I need to capture all links in a given html.

Here is sample code:

<div class="infobar">
    ... some code goes here ...
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
    ... some code goes here ...
</div>

I need to get all links inside div.infobar that starts with /link/

I tried this:

preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);

but it gives me the only first match.

Thanks for advices.

Maybe there's an html parser that will do this more easily for you? — , Jun 23 '11 at 23:33
I am already getting it first getting the inside of div.infobar with preg_match then getting the links with preg_match_all. but since regex offers more flexibility, why I shouldn't use it? I just need a good pattern. I want to know how to accomplish that with just 1 preg_match_all — Valour, Jun 23 '11 at 23:35
You cannot do that with a single regex. You first need to isolate the div and then extract the desired links from it. -- What the stubby comments are about: you can extract the links easier with phpQuery or [QueryPath](http://querypath.org/) using `foreach (qp($html)->find("div.infobar a") as $a) { print $a->attr("href"); }` Using a specific regex is really only appropriate for performance reasons, if it's a known coherent html input blob. — mario, Jun 23 '11 at 23:35
HTML is not a regular language, so it is [unwise to use a regular expression to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — sarnold, Jun 24 '11 at 00:28
@stereofrog, fair point; there's no way I can improve upon [anubhava's answer](http://stackoverflow.com/questions/6461732/using-a-regular-expression-to-extract-urls-from-links-in-an-html-document/6461935#6461935) for this specific case, and I think a little levity is a fantastic way to show that trying to use the wrong tool for the job can lead to incredible frustration. — sarnold, Jun 24 '11 at 01:51
@Stereofrog, [our very own Jeff Atwood](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) provides further advice that parsing a non-regular language such as HTML with a regular expression might just work most of the time, but is brittle. Yes, [newer engines called 'regular expression'](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt) can match some non-regular languages, but I still believe a match written in one of these languages will be harder to maintain over time than using a more powerful parser such as the `DOMDocument` or XPath approaches. — sarnold, Jun 24 '11 at 09:11

score 7 · Accepted Answer · answered Jun 23 '11 at 23:44

I would suggest using DOMDocument for this very purpose rather than using regex. Consider following simple code:

$content = '
<div class="infobar">
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);

// To hold all your links...
$links = array();

// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
  // Check the class attr of each div
  $cl = $div->getAttribute("class");
  if ($cl == "infobar") {
    // Find all hrefs and append it to our $links array
    $hrefs = $div->getElementsByTagName("a");
    foreach ($hrefs as $href)
       $links[] = $href->getAttribute("href");
  }
}
var_dump($links);

OUTPUT

array(4) {
  [0]=>
  string(15) "/link/some-text"
  [1]=>
  string(18) "/link/another-text"
  [2]=>
  string(12) "/link/blabla"
  [3]=>
  string(13) "/link/whassup"
}

what is the execution time between this and regex? I can do this with just 2 preg_match_all functions. — Valour, Jun 27 '11 at 12:11
Execution time will be comparable (or even better) than regex based code but more importantly DOM based code will NOT break at unexpected time as compared to regex code. — anubhava, Jun 27 '11 at 12:25

Jacob Eggers · Answer 2 · 2011-06-24T00:22:23.763

2

Revising my previous answer. You'll need to do it in two steps:

//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);

//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);

edited Jun 24 '11 at 00:22

answered Jun 23 '11 at 23:26

Jacob Eggers

9,062
2
25
43

Thanks. but this time it gets the last one :D – Valour Jun 23 '11 at 23:29
I split it into two steps. The div gets matched the first time, and then can't be matched again. – Jacob Eggers Jun 24 '11 at 00:24

score 2 · Answer 3 · answered Jun 24 '11 at 02:04

2

http://simplehtmldom.sourceforge.net/ :

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

answered Jun 24 '11 at 02:04

iHaveacomputer

1,427
4
14
30

score 0 · Answer 4 · answered Jun 23 '11 at 23:21

0

Try this (I added a +):

preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);

answered Jun 23 '11 at 23:21

agent-j

27,335
5
52
79

Using a regular expression to extract URLs from links in an HTML document

4 Answers4

OUTPUT