0

I have for example these "a" tags:

<a href="http://www.domain.com/products/foo">Foo product</a>
<a href="/articles/bar">Bar article</a>

I use this pattern:

/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

This expression returns to me both tags (foo product and bar article). Can you help me please how to make an expression that returns only tag "bar article"?

Thank you.

EDIT:

@Avinash Raj thank you for the tip.

These result of the pattern works for me:

/^.*<a\s[^>]*href="http:\/\/.*$(*SKIP)(*F)|<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\1[^>]*>(.*?)<\/a>/miU
kevas
  • 551
  • 2
  • 7
  • 22
  • use a negative lookahead. But really you should be parsing the html – exussum Jul 28 '14 at 11:28
  • 5
    I’m just going to leave this here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – DNNX Jul 28 '14 at 11:30
  • @user What would be the expected output? – Avinash Raj Jul 28 '14 at 11:37
  • 1
    don't parse HTML with regexes - it's not a regular language. Use a special parser instead – user4035 Jul 28 '14 at 11:37
  • @user3468684 see http://regex101.com/r/sQ0kW4/6 – Avinash Raj Jul 28 '14 at 11:39
  • That regular expression is quite complex, which makes it difficult to read and maintain. Please don't ignore the advice against using regular expressions, it's there for a reason! An approach using parser may be longer but at least it is fairly self-documenting, which is _much_ more important. Please take a look at my answer and if you have any questions, let me know. – Tom Fenech Jul 28 '14 at 12:33

2 Answers2

1

Use a DOM parser, such as DOMDocument:

<?php
$site = <<<'EOT'
<a href="http://www.domain.com/products/foo">Foo product</a>
<a href="/articles/bar">Bar article</a>
EOT;

$doc = new DOMDocument();
$doc->loadHTML($site);

$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $a) {
    $href = $a->getAttribute('href');
    $scheme = parse_url($href, PHP_URL_SCHEME);
    if (!isset($scheme)) {            
        echo $a->textContent;   // output: Bar article
    }
}

Loop through each <a> element. Parse the url, using parse_url. If the scheme isn't set in the href attribute, then echo the content. Of course, what you actually want to do with the element is entirely up to you.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
0

You can use

<a href="(.*)<\/a>

with preg_match_all and then get the last result out of the $matches array with

$web =   '<a href="http://www.domain.com/products/foo">Foo product</a>
          <a href="/articles/bar">Bar article</a>';
preg_match_all("/<a href=\"(.*)<\/a>/", $web , $matches); 

print_r( $matches[0][count($matches[0])-1]); // should only give Bar article

BUT like someone already pointed: do not use regex to search through DOM. Use DOM parser instead!

trainoasis
  • 6,419
  • 12
  • 51
  • 82