Regexp - match tag 'a' without http:// in href

Question

I have for example these "a" tags:

<a href="http://www.domain.com/products/foo">Foo product</a>
<a href="/articles/bar">Bar article</a>

I use this pattern:

/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

This expression returns to me both tags (foo product and bar article). Can you help me please how to make an expression that returns only tag "bar article"?

Thank you.

EDIT:

@Avinash Raj thank you for the tip.

These result of the pattern works for me:

/^.*<a\s[^>]*href="http:\/\/.*$(*SKIP)(*F)|<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\1[^>]*>(.*?)<\/a>/miU

use a negative lookahead. But really you should be parsing the html — exussum, Jul 28 '14 at 11:28
I’m just going to leave this here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — DNNX, Jul 28 '14 at 11:30
don't parse HTML with regexes - it's not a regular language. Use a special parser instead — user4035, Jul 28 '14 at 11:37
That regular expression is quite complex, which makes it difficult to read and maintain. Please don't ignore the advice against using regular expressions, it's there for a reason! An approach using parser may be longer but at least it is fairly self-documenting, which is _much_ more important. Please take a look at my answer and if you have any questions, let me know. — Tom Fenech, Jul 28 '14 at 12:33

score 1 · Accepted Answer · answered Jul 28 '14 at 12:14

Use a DOM parser, such as DOMDocument:

<?php
$site = <<<'EOT'
<a href="http://www.domain.com/products/foo">Foo product</a>
<a href="/articles/bar">Bar article</a>
EOT;

$doc = new DOMDocument();
$doc->loadHTML($site);

$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $a) {
    $href = $a->getAttribute('href');
    $scheme = parse_url($href, PHP_URL_SCHEME);
    if (!isset($scheme)) {            
        echo $a->textContent;   // output: Bar article
    }
}

Loop through each <a> element. Parse the url, using parse_url. If the scheme isn't set in the href attribute, then echo the content. Of course, what you actually want to do with the element is entirely up to you.

Ok, this is the best solution. Easy to read and more useful than regexp. Thank you. — kevas, Jul 28 '14 at 13:36

score 0 · Answer 2 · answered Jul 28 '14 at 11:48

You can use

<a href="(.*)<\/a>

with preg_match_all and then get the last result out of the $matches array with

$web =   '<a href="http://www.domain.com/products/foo">Foo product</a>
          <a href="/articles/bar">Bar article</a>';
preg_match_all("/<a href=\"(.*)<\/a>/", $web , $matches); 

print_r( $matches[0][count($matches[0])-1]); // should only give Bar article

BUT like someone already pointed: do not use regex to search through DOM. Use DOM parser instead!

Regexp - match tag 'a' without http:// in href

2 Answers2