How to get 5 characters before and after?

Question

I need get

bbish name3 more

bbish name4 more

$p = '%<a\s+href="my-anchor-name3"\s*>(?:.*)</a>%im';
$s = 'some rubbish
<a href="my-anchor-name1">name</a>more rubbish
more rubbish<a href="my-anchor-name2">name2</a>more rubbish
more rubbish<a href="my-anchor-name3">name3</a>more rubbish
more rubbish<a href="my-anchor-name3">name4</a>more rubbish
more rubbish<a href="my-anchor-name5">name5</a>more rubbish';
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);

what am I doing wrong?

Sorry... what? Where is the output you are getting? Where are you trying to "get" 5 characters? Before and after _what_? — Lightness Races in Orbit, Aug 11 '11 at 16:57

score 3 · Accepted Answer · edited May 23 '17 at 12:11

what am I doing wrong?

You're not instructing PHP to do what you have indicated that you want to do, is the main flaw.

Problems

You did not create an array into which to deposit the matches;
You're not capturing any backreferences;
Your capture inside the a tag is greedy;
I suspect that you don't really want to restrict your href value like that;
Your HTML input is very restricted, because you're using regular expressions to parse HTML.... grrrrr!! *

Fix

Try this:

<?php
$matches = Array();
$p = '%(.{0,5})<a\s+href="my-anchor-name3"\s*>(.*?)</a>(.{0,5})%imm';
$s = 'some rubbish
<a href="my-anchor-name1">name</a>more rubbish
more rubbish<a href="my-anchor-name2">name2</a>more rubbish
more rubbish<a href="my-anchor-name3">name3</a>more rubbish
more rubbish<a href="my-anchor-name3">name4</a>more rubbish
more rubbish<a href="my-anchor-name5">name5</a>more rubbish';
$out = preg_match_all($p, $s, $matches, PREG_SET_ORDER);
print_r($matches);
?>

Output:

Array
(
    [0] => Array
        (
            [0] => bbish<a href="my-anchor-name3">name3</a>more 
            [1] => bbish
            [2] => name3
            [3] => more 
        )

    [1] => Array
        (
            [0] => bbish<a href="my-anchor-name3">name4</a>more 
            [1] => bbish
            [2] => name4
            [3] => more 
        )

)

Live demo.

Further work

You may wish to further restrict what characters may be eaten up in those backreferences.

And if you don't want to limit your href values the way you are (and you're doing it in quite a confusing way at present):

$p = '%(.{0,5})<a\s+href="my-anchor-name\d+"\s*>(.*?)</a>(.{0,5})%imm';

Like this.

* The real answer here is that you should not be using regular expressions to parse HTML, which is a well-known fact. Marc has the solution that you should be using.

score 2 · Answer 2 · answered Aug 11 '11 at 17:01

Do not use regexes. Period. it's trivial to extract text nodes before/after a particular node's position using DOm functions.

$dom = new DOMDocument();
$dom-loadHTML($html);

$xp = new DOMXPath($dom);

$res = $xp->query('//a[starts-with(@href, "my-anchor-name")]');
$out = array()
foreach($res as $a) {
    $previous = substr($a->previousSibling->nodeValue, -5);
    $next = substr($a->nextSibling->nodeValue, -5);
    $here = $a->nodeValue;

    $out[] = $previous . $here . $next;
}

Dan · Answer 3 · 2011-08-11T17:13:47.770

0

You're not really giving enough data to make this work exactly, but based on the sample data above, this should work:

$p = '/(.{5})<a\shref="my\-anchor\-(name[0-9]+)">.*</a>(.{5})/';
if (preg_match($p, $s, $matches, PREG_SET_ORDER)) {
  echo "Matches found.";
} else {
  echo "Matches not found.";
}

Then simply handle all the search hits in the $matches array as you please.

edited Aug 11 '11 at 17:13

answered Aug 11 '11 at 17:06

Dan

4,488
5
48
75

Your `` matching is broken. – Lightness Races in Orbit Aug 11 '11 at 17:11
Ahh yes, thanks. Edited it. I didn't test this, just threw it together real quick as an example. – Dan Aug 11 '11 at 17:14
You're now missing one of the outputs, and I think not supporting multi-line input properly. I could go on. :) – Lightness Races in Orbit Aug 11 '11 at 17:15

score -1 · Answer 4 · answered Aug 11 '11 at 16:58

-1

You could prepend and append something like this to the regex: (.{5}).

Thus:

$p = '%(.{5})<a\s+href="my-anchor-name3"\s*>(?:.*)</a>(.{5})%im';

answered Aug 11 '11 at 16:58

Legolas

1,432
10
11

How to get 5 characters before and after?

4 Answers4

Problems

Fix

Further work