-2

I have an html file containing some data, including some urls.

Only on theses urls, I want to replace the _ character by a space (via a php file).

So an url like this:

</p><p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this_is_an_example.html">How_to_sample.</a>

will become

</p><p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this is an example.html">How_to_sample.</a>

This has not to affect the _ that are not on urls.

I think this might be possible with a preg_replace, but i don't know how to proceed for this.

The following code in incorrect as it replace every _ and not just the one in url.

$content2 = preg_replace('/[_]/', ' ', $content);

Thanks.

EDIT:

Thanks for preg_replace_callback suggestion, this is what I was looking for.

    // search pattern
    $pattern = '/href="http:\/\/10.20.0.30:1234\/index.php\/(.*?).html">/s';

    // the function call
    $content2 = preg_replace_callback($pattern, 'callback', $content);

    // the callback function
    function callback ($m) {
        print_r($m);
        $url = str_replace("_", " ", $m[1]);
        return 'href="http://10.20.0.30:1234/index.php/'.$url.'.html">';
    }
Kodatrololo
  • 25
  • 1
  • 1
  • 15

1 Answers1

1

Older and wiser: Don't use regex - it is not necessary and it may be prone to instability because regex is not DOM-aware. Use an HTML parser to isolate the <a> tags and then the href attribute, then make a simple str_replace() call.

Code: (Demo)

$html = <<<HTML
<p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this_is_an_example.html">How_to_sample.</a></p>
HTML;

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach($dom->getElementsByTagName('a') as $a) {
    $a->setAttribute('href', str_replace('_', '%20', $a->getAttribute('href')));
}
echo $dom->saveHTML();

Output:

<p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this%20is%20an%20example.html">How_to_sample.</a></p>

A url should not contain any spaces, spaces should be encoded as %20. - Is a URL allowed to contain a space?


Original answer:

If you are open to some regex trickery, you can accomplish your task with preg_replace() alone.

Code: (Demo)

$input = '</p><p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this_is_an_example.html">How_to_sample.</a>';

$pattern = '~(?:\G|\Qhttp://10.20.0.30:1234/index.php\E[^_]+)\K_([^_.]*)~';

echo preg_replace($pattern, " $1", $input);

Output:

</p><p><a rel="nofollow" class="external text" href="http://10.20.0.30:1234/index.php/this is an example.html">How_to_sample.</a>

\G is the "continue" metacharacter. It allows you to make multiple consecutive matches after the expected porrion of the url.

\Q..\E says "treat all characters between the two points literally-- so no escaping is necessary.

\K means "restart the fullstring match from this point".

Pattern Demo

Since you are building a url, I reckon you should be replacing with %20.

I suppose my pattern should deny the start of the string after \G for best practices...

$pattern = '~(?:\G(?!^)|\Qhttp://10.20.0.30:1234/index.php\E[^_]+)\K_([^_.]*)~';
mickmackusa
  • 43,625
  • 12
  • 83
  • 136