-1

I need get

bbish name3 more

bbish name4 more

$p = '%<a\s+href="my-anchor-name3"\s*>(?:.*)</a>%im';
$s = 'some rubbish
<a href="my-anchor-name1">name</a>more rubbish
more rubbish<a href="my-anchor-name2">name2</a>more rubbish
more rubbish<a href="my-anchor-name3">name3</a>more rubbish
more rubbish<a href="my-anchor-name3">name4</a>more rubbish
more rubbish<a href="my-anchor-name5">name5</a>more rubbish';
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);

what am I doing wrong?

Mediator
  • 14,951
  • 35
  • 113
  • 191

4 Answers4

3

what am I doing wrong?

You're not instructing PHP to do what you have indicated that you want to do, is the main flaw.


Problems

  • You did not create an array into which to deposit the matches;
  • You're not capturing any backreferences;
  • Your capture inside the a tag is greedy;
  • I suspect that you don't really want to restrict your href value like that;
  • Your HTML input is very restricted, because you're using regular expressions to parse HTML.... grrrrr!! *

Fix

Try this:

<?php
$matches = Array();
$p = '%(.{0,5})<a\s+href="my-anchor-name3"\s*>(.*?)</a>(.{0,5})%imm';
$s = 'some rubbish
<a href="my-anchor-name1">name</a>more rubbish
more rubbish<a href="my-anchor-name2">name2</a>more rubbish
more rubbish<a href="my-anchor-name3">name3</a>more rubbish
more rubbish<a href="my-anchor-name3">name4</a>more rubbish
more rubbish<a href="my-anchor-name5">name5</a>more rubbish';
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);
print_r($matches);
?>

Output:

Array
(
    [0] => Array
        (
            [0] => bbish<a href="my-anchor-name3">name3</a>more 
            [1] => bbish
            [2] => name3
            [3] => more 
        )

    [1] => Array
        (
            [0] => bbish<a href="my-anchor-name3">name4</a>more 
            [1] => bbish
            [2] => name4
            [3] => more 
        )

)

Live demo.


Further work

You may wish to further restrict what characters may be eaten up in those backreferences.

And if you don't want to limit your href values the way you are (and you're doing it in quite a confusing way at present):

$p = '%(.{0,5})<a\s+href="my-anchor-name\d+"\s*>(.*?)</a>(.{0,5})%imm';

Like this.


* The real answer here is that you should not be using regular expressions to parse HTML, which is a well-known fact. Marc has the solution that you should be using.

Community
  • 1
  • 1
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
2

Do not use regexes. Period. it's trivial to extract text nodes before/after a particular node's position using DOm functions.

$dom = new DOMDocument();
$dom-loadHTML($html);

$xp = new DOMXPath($dom);

$res = $xp->query('//a[starts-with(@href, "my-anchor-name")]');
$out = array()
foreach($res as $a) {
    $previous = substr($a->previousSibling->nodeValue, -5);
    $next = substr($a->nextSibling->nodeValue, -5);
    $here = $a->nodeValue;

    $out[] = $previous . $here . $next;
}
Marc B
  • 356,200
  • 43
  • 426
  • 500
0

You're not really giving enough data to make this work exactly, but based on the sample data above, this should work:

$p = '/(.{5})<a\shref="my\-anchor\-(name[0-9]+)">.*</a>(.{5})/';
if (preg_match($p, $s, $matches, PREG_SET_ORDER)) {
  echo "Matches found.";
} else {
  echo "Matches not found.";
}

Then simply handle all the search hits in the $matches array as you please.

Dan
  • 4,488
  • 5
  • 48
  • 75
-1

You could prepend and append something like this to the regex: (.{5}).

Thus:

$p = '%(.{5})<a\s+href="my-anchor-name3"\s*>(?:.*)</a>(.{5})%im';
Legolas
  • 1,432
  • 10
  • 11