First, obligatory "do not parse HTML with RegEx" followed by "if you control the HTML it might be safe".
You're definition of "empty" is a little vague. Your sample is using literally empty but the RegEx you are showing has a #
in it. Regardless, I came up with two ways.
I personally don't like preg_replace
too much unless I'm using it to remove something, but that's just me, and it isn't wrong to use. This first example just looks for an opening <a
tag, followed by non-greedy match for non-closing tag >
, followed by an href="
, a hash or space, a closing "
and anything else until >
. This can very easily break, for instance if you have spaces around the href =
or if you use different quotes, but that's a bridge to cross later possibly.
$testData = [
'<a href=""><img/></a>',
'<a href="#"><img/></a>',
'<a class="button" href="#" rel="nofollow"><img/></a>',
'<a href=""><img/></a>',
];
foreach ($testData as $test) {
echo preg_replace('/(<a[^>]*?href=")(?:#|\s+)?("[^>]*>)/', '$1https://example.com$2', $test), PHP_EOL;
}
The second version uses DOMDocument
which I know you said you didn't want to use, but honestly it allows you to reason about things such better, to the point that I don't think it needs any more comment.
$testData = [
'<a href=""><img/></a>',
'<a href="#"><img/></a>',
'<a class="button" href="#" rel="nofollow"><img/></a>',
'<a href=""><img/></a>',
];
foreach ($testData as $test) {
$d = new DOMDocument();
$d->loadHTML($test, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($d->getElementsByTagName('a') as $tag) {
if (in_array(trim($tag->getAttribute('href')), ['#', ''], true)) {
$tag->setAttribute('href', 'https://example.com');
}
}
echo $d->saveHTML(), PHP_EOL;
}
I will note, the DOMDocument might change your HTML just slightly, however for most people these days, HTML vs XHTML doesn't really matter much anymore.