PHP + regex to find and replace any tag that is within a certain class in a source code

Question

I have a HTML code in a PHP variable, I need to replace every links that are containted within another tag that has a "obfuscate" class, eg:

<div class="obfuscate foobar">
    <strong>
        <a href="https://example.com" class="randomclass" target="_BLANK">test</a>
    </strong>
</div>

I need the <a> tag to be replaced with a <span> that inherits everything from the original tag, with a "akn-obf-link" class added, and an obfuscated link passed through base64_encode() under a "data-o" attribute, and a "data-b" attribute that has the value "1" if the link has a target _blank or "0" otherwise.

In the example above, the <a> tag should be converted to:

<span class="akn-obf-link randomclass" data-o="aHR0cHM6Ly9leGFtcGxlLmNvbQ==" data-b="1">test</span>

I already have a code that does that when the <a> tag itself has the "obfuscate" class if that might help:

$result = preg_replace_callback('#<a[^>]+((?<=\s)href=(\"|\')([^\"\']*)(\'|\")[^>]+(?<=\s)class=(\"|\')[^\'\"]*(?<!\w|-)obfuscate(?!\w|-)[^\'\"]*(\"|\')|(?<=\s)class=(\"|\')[^\'\"]*(?<!\w|-)obfuscate(?!\w|-)[^\'\"]*(\"|\')[^>]+(?<=\s)href=(\"|\')([^\"\']*)(\'|\"))[^>]*>(.*)<\/a>#miUs', function($matches) {
        preg_match('#<a[^>]+(?<=\s)class=[\"|\\\']([^\\\'\"]+)[\"|\\\']#imUs',$matches[0],$matches_classes);
        $classes = trim(preg_replace('/\s+/',' ',str_replace('obfuscate','',$matches_classes[1])));
        return '<span class="akn-obf-link'.($classes?' '.$classes:'').'" data-o="'.base64_encode($matches[3]?:$matches[10]).'" data-b="'.((strpos(strtolower($matches[0]),'_blank')!==false)?'1':'0').'">'.$matches[12].'</span>';
    }, $code);

I need the same but whenever the tag is inside another tag that has the "obfuscate" class.

Any particular reason why you want to use regexes? HTML parsers are much better suited for parsing HTML. — Destroy666, Apr 03 '23 at 14:14
Initially coded it to be the most universal possible, light, and working without any library — rAthus, Apr 03 '23 at 16:44
That's definitely not "light" compared to HTML parsing, which is also as "universal" as regexes, that have different capabilities across different languages. — Destroy666, Apr 03 '23 at 16:52
I would first search for all divs with the *obfuscate* class and then, in the callback, look for all `` tags to do a second *preg_replace_callback()* to transform your link into the span. We could also simplify a bit your regex. Typically, instead of `(?<=\s)` in front of the attributes, we can just use `\b` for a word boundary. — Patrick Janser, Apr 04 '23 at 07:35

score 0 · Answer 1 · answered Apr 04 '23 at 14:17

Trying to solve this with a regex will be painfull and unsafe for several reasons discussed so many times on Stackoverflow.

Typically what will happen if the <div class="obfuscate"> contains some child <div> tags?

<div class="obfuscate foobar">
    <div>Something</div>
    <strong>
        <a href="https://example.com" class="randomclass" target="_BLANK">test</a>
    </strong>
</div>

This will mean that you'll have to handle recursion in your regular expression as this regex will not work:

~<\s*div\s+
# The mandatory class anywhere in the tag:
(?=[^>]*\bclass="(?<class>[^>]*?)")
# The rest of the attributes:
[^>]*>
# The content of the <div>, in an ungreedy way:
(.*?)
# The closing </div> tag:
<\s*/\s*div\s*>~gsx

As we can see here, it's not capturing the full content of the div. You'll need to have a well-balanced regex to solve this issue.

Ok, let's assume you have the class="..." attribute with nice double quotes like in a good old romantic film. And we assume you don't have child divs. This means you can capture the inner HTML and then look for all <a> tags with a relatively complex pattern like this one:

~# Declaration of all regex sub-routines:
(?(DEFINE)
# This sub-routine will match an attribute value with or without the quotes around it.
# So it will match "https://example.com" or 'https://example.com' (example with href)
# but also match my-class-name if we had something like <div class=my-class-name>
(?<attr_value_with_delim>(?:(?<delimiter>["']).*?(?:\k<delimiter>)|[^"'=<>\s]+))
)

# The regex pattern starts here:
# Match an opening <a> tag.
<\s*a\s+
# All the attributes are optional as <a name="my-anchor"></a> is allowed.
# But you can remove the ? at the end if you want to make them mandatory.
# You may also add other attributes such as hreflang, type, data-*, etc.
(?=[^>]*\bhref\s*=\s*(?<href>\g<attr_value_with_delim>))?
(?=[^>]*\bid\s*=\s*(?<id>\g<attr_value_with_delim>))?
(?=[^>]*\bclass\s*=\s*(?<class>\g<attr_value_with_delim>))?
(?=[^>]*\bname\s*=\s*(?<name>\g<attr_value_with_delim>))?
(?=[^>]*\btarget\s*=\s*(?<target>\g<attr_value_with_delim>))?
(?=[^>]*\btitle\s*=\s*(?<title>\g<attr_value_with_delim>))?
(?=[^>]*\bdownload\s*=\s*(?<download>\g<attr_value_with_delim>))?
(?=[^>]*\brel\s*=\s*(?<rel>\g<attr_value_with_delim>))?
[^>]*>
(.*?)
<\s*/\s*a\s*>~isxg

I've made it here: https://regex101.com/r/ZSx69l/2

I wanted to handle attributes with double quotes, single quotes and no quotes. I tried to capture the value without the quotes, but didn't find how to do it right. Never mind, because with the preg_replace_callback() function you can then drop the quotes with trim(..., '"\'') or with a regex. You'll then be able to calculate your base64 and rewrite it to the desired output.

But will this really solve all the malformed HTML code? Probably not.

I would stick with PHP's DOMDocument to have something safe. It's installed everywhere these days, and the execution time will not be significant compared to the risks of bugs.

You might not need to parse the full content of your HTML page by grabbing what you need with a bulletproof regex before.

PHP + regex to find and replace any tag that is within a certain class in a source code

1 Answers1