0

I am presented with an HTML document similar to this in view source mode (the below is simplified for brevity):

<html>
    <head>
        <title>System version: {{variable:system_version}}</title>
    </head>
    <body>
        <p>You are using system version {{variable:system_version}}</p>
        {{block:welcome}}
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>

I have written some functions that can replace these {{...}} type strings, but they need to be replaced selectively.

In the example above, I want it replaced in <title> and in <p>, but not in <input> and <textarea> as this is user-provided input, that would be inserted via a wysiwyg editor or form, and must be saved as received from the user. The {{block:welcome}} must also be replaced with whatever content it contains.

When rendering my output, I will sanitize it, then result should be something like this:

<html>
    <head>
        <title>System version: 6.0</title>
    </head>
    <body>
        <p>You are using system version 6.0</p>
        <div>
            This was the content of the welcome block.
        </div>
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>

Here is what I have tried. For the below code, $var's value is '6.0' and $val's value = '{{variable:system_version}}', and $data is the entire string to be searched:

if (!preg_match('/<textarea|<input|<select(.+?)' . $val . '(.+?)<\/textarea|<\/input|<\/select\>/s', $data)) {
    $data = str_replace($val, $var, $data);
}    

Please advise what is wrong with my regex, as it currently replaces nothing whatsoever, so the if condition is never matched. If I do the str_replace without the if, the replacements are made, in all cases.

EDIT 1

After some assistance by @Emma, the replacement still does not work. The below is the code that does the replacement as it stands:

    function replace_variable($matches, $data)
    {
        $ci =& get_instance();
        if (!empty($matches['variable_matches'])) {
            foreach ($matches['variable_matches'][0] as $key => $val) {
                $vals = explode(':', $val);
                $ci->load->module('core');
                $var = $ci->core->get_variable(rtrim($vals[1], '}}'));
                $re1 = '/<(?:textarea|select)[\s\S]*?>[\s\S]*?(' . $val . ')[\s\S]*?<\/(?:textarea|select)>/';
                $re2 = '/<(?:input)[\s\S]*?(' . $val . ')[\s\S]*?>/';
                if (!preg_match($re1, $data) && !preg_match($re2, $data)) {
                    $data = str_replace($val, $var, $data);
                }
            }
        }
        return $data;
    }

Here are the output values of the matches found via preg_match, and then I am trying to replace via str_replace where NOT inside a form tag (select/textarea/input).

Array
(
    [0] => Array
        (
            [0] => {{variable:system_version}}
            [1] => {{variable:system_version}}
            [2] => {{variable:system_version}}
            [3] => {{variable:system_version}}
        )

    [1] => Array
        (
            [0] => system_version
            [1] => system_version
            [2] => system_version
            [3] => system_version
        )

)

So - there are four matches on the page where I try to replace, two of them inside form tags, the other two not. The check is done on the entire output that is buffered, and contains all four elements, but somehow, the preg_match triggers for all of them, despite the regex. Any ideas what I am doing wrong?

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Kobus Myburgh
  • 1,114
  • 1
  • 17
  • 46
  • 1
    This looks like a job for `DOMXPath`, using something like `//title[contains(.,"{{variable:system_version}}")] | //p[contains(.,"{{variable:system_version}}")]` don't parse HTML with regular expressions if there is a good parser to hand. – Dean Taylor Jul 21 '19 at 23:14
  • Thanks, Dean, the think is that my content is user provided, and varies. I have {{blocks}}, {{variables}}, and whatever else, and the structure of the document given was simplified significantly to keep the question short. Technically, these can be totally different pages with completely different structure. I have no objection to using any solution that will work, but not sure how DOMXPath could work with variable input such as I may encounter. – Kobus Myburgh Jul 21 '19 at 23:25

2 Answers2

1

I was about to post an answer on your next question but Casimir closed it before I got the chance. I am coming back here to post a proper html parse-then-replace technique for the benefit of researchers and you.

Code: (Demo)

define('LOOKUP', [
    'block' => [
        'welcome-intro'         => 'custom intro'
    ],
    'variable' => [
        'contact-email-address' => 'mmu@mmu.com',
        'system_version'        => 'sys ver',
        'system_name'           => 'sys name',
        'system_login'          => 'sys login',
        'activate_url'          => 'some url'
    ],

]);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);

foreach ($xpath->query("//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node) {
    $node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) {
            return LOOKUP[$m[1]][$m[2]] ?? '**unknown variable**';
        },
        $node->nodeValue);
}
echo $dom->saveHTML();

Output:

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"><title>Test</title></head><body>
    <section id="about"><div class="container about-container">
            <div class="row">
                <div class="col-md-12">
                    custom intro
                </div>
            </div>
        </div>
    </section><section id="services"><div class="container">
            <div class="row">
                <div class="col-md-12">
                                        <p>You are using system version: sys ver</p>
                    <p>Your address: mmu@mmu.com</p>
                    <form action="http://k.loc/content/view/welcome" class="default-form" enctype="multipart/form-data" method="post" accept-charset="utf-8">
                                                                                    <input type="hidden" name="csrfkcmstoken" value="94ee71ada809b9a79d1b723c81020c78"><div class="row">
                            <div class="col-sm-12 form-error"></div>
                        </div>
                    <div class="row"><div class="col-sm-12"><fieldset id="personalinfo"><legend>Personal information</legend><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testinput">Name<span class="form-validation-required"> * </span></label>

                    </div>
                <div class="hint-text">Enter at least 2 characters and a maximum of 12 characters.</div><input id="testinput" name="testinput" placeholder="Enter your name here." class="input-group width-50" type="text" value="{{{variable:system_name}}}  {{{variable:system_login}}}"><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testpassword">Password</label>

                    </div>
                <div class="hint-text">Your password must be at least 12 characters long, contain 1 special character, 1 nunber, 1 lower case character and 1 upper case character.</div><input id="testpassword" name="testpassword" placeholder="Enter your password here." class="input-group width-50" type="password"><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div></fieldset></div></div><div class="row"><div class="col-sm-12"><fieldset id="bioinfo"><legend>Biographical information</legend><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testtextarea">Biography</label>
                <span class="hint-text">A minimum of 40 characters and a maximum of 255 is allowed. This hint is displayed inline.</span>
                    </div>
                <textarea id="testtextarea" name="testtextarea" placeholder="Please enter your biography here." class="input-group-wide width-100" rows="5" cols="80">{{{variable:system_name}}}

{{{variable:system_login}}}</textarea><div class="row"><div class="col-sm-12"><div class="form-error"></div></div></div></div></div><div class="row"><div class="col-sm-12">
                    <div class="control-label">
                        <label for="testsummernote">Interests</label>
                <span class="hint-text">A minimum of 40 characters is required. This hint is displayed inline.</span>
                    </div>
                <textarea id="testsummernote" name="testsummernote" class="wysiwyg-editor" placeholder="Please enter your interests here."><p>sys name<br></p><p>sys login</p><p>some url<br></p></textarea></div></div></fieldset></div></div><div class="row"><div class="col-sm-12"><button name="testsubmit" id="testsubmit" type="submit" class="btn primary">Submit<i class="zmdi zmdi-arrow-forward"></i></button></div></div>
        </form>                </div>
            </div>
        </div>
    </section></body></html>

There aren't too many tricks involved.

  1. Parse the HTML with DOMDocument and write a filtering query with XPath which requires nodes to not be textarea|select|input tags and they must contain {{{ in their text. There will be several "magical" ways to filter the dom -- this is just one way that feels efficient/direct to me.

  2. I use preg_replace_callback() to perform replacements based on a lookup array.

  3. To avoid use() in the callback syntax, I make the lookup available inside the callback's scope by declaring it as a constant (I can't imagine you need it to be a variable anyhow).

  4. I found during testing that DOMDocument didn't like the <section> tags, so I silenced the complaints with libxml_use_internal_errors(true);.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Just to @Emma's defense (not that she needs it, I think) she addressed the problem I had, where the code I have was already using regexes. She did not suggest I use the regexes, it was me who had them. I don't want my lack of flexibility to harm her reputation. I may change it later, but for now, if Emma's new solution works, I will stick with it, as I am already way over deadline with what I needed to do there. Thanks for your comments, and your answers, Casimir and mickmackusa. – Kobus Myburgh Jul 27 '19 at 17:12
  • @mickmacusa, Casimir posted in his answer as well that his answer does not follow standards or something to that effect. I once again say thanks fir your help, and I am going to investigate the DOM solution, as I am still running into occasional issues under certain circumstances. – Kobus Myburgh Jul 28 '19 at 07:17
  • Emma's solution now replace some variables inconsistently if they are NOT contained in an input, and only in the body text of the page. I have decided to try the other solution that you (and Casimir) suggested, and if it works, I will move the checkbox. Emma did tell me that this is not the best solution as well, so I will provide more feedback once I have got your/Casimir solution working. – Kobus Myburgh Jul 28 '19 at 07:22
  • 1
    I have implemented your solution, thank you. I have moved the accepted answer accordingly. It works perfectly, and I also benchmarked it, and it was significantly faster on an HTML string of 1MB. I had to rip out all my flawed code and there is big work ahead, but now the foundation is correct. Thanks again. – Kobus Myburgh Jul 28 '19 at 08:48
0

My guess is that you are likely designing an expression similar to:

<(?:textarea|select)[\s\S]*?({{variable:system_version}})[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>

which you might probably want to modify it, and then replace with what you like to replace.

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Test

$re = '/<(?:textarea|select)[\s\S]*?({{variable:system_version}})[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>/m';
$str = '<html>
    <head>
        <title>System version: 6.0</title>
    </head>
    <body>
        <p>You are using system version 6.0</p>
        <div>
            This was the content of the welcome block.
        </div>
        <form>
            <input value="System version: {{variable:system_version}}">
            <textarea>
                You are using system version {{variable:system_version}}.
            </textarea>
        </form>
    </body>
</html>';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here


Edit for two steps:

<(?:textarea|select)[\s\S]*?>[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?>

Demo 1

^<(?:input)[\s\S]*?({{variable:system_version}})[\s\S]*?>$

Demo 2

^<(?:input).*?({{variable:system_version}}).*?>$

Demo 3

Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    Thanks for answering, Emma. Not sure what I am doing wrong. What I have now is this: `$re = '/<(?:textarea|select)[\s\S]*?(' . $val . ')[\s\S]*?<\/(?:textarea|select)>|<(?:input)[\s\S]*?(' . $val . ')[\s\S]*?>/m'; if (!preg_match($re, $data)) { $data = str_replace($val, $var, $data); }` This does not work as expected. $val is the placeholder. It is not always limited to variable:system_version. It could be variable:anything. – Kobus Myburgh Jul 22 '19 at 00:24
  • 1
    When I do `!preg_match`, nothing replaces. When I do `preg_match`, all get replaced. I only want to replace when not in input, textarea or select. – Kobus Myburgh Jul 22 '19 at 00:31
  • 1
    I also looked at regex101: https://regex101.com/r/eM5bR8/3. It is not picking up the textarea. I then need to negate it because it only needs to replace when NOT matching input, select, textarea. – Kobus Myburgh Jul 22 '19 at 01:05
  • 1
    Latter issue looks like /g modifier not included. When including it, preg_match says unknown modifier. So can it only be used with preg_match_all then? It makes sense to me. – Kobus Myburgh Jul 22 '19 at 01:19
  • 1
    Hi Emma. I am updating the question with my current function. It does not replace anything, and I am not sure what I am doing wrong. Hope you can assist once more. – Kobus Myburgh Jul 25 '19 at 07:57
  • 1
    Hi again, Emma. I have found the reason why mine does not work. When an input, textarea or select does NOT contain the placeholder, the matches are broken. See: https://regex101.com/r/4AdaR7/2. Any idea why? – Kobus Myburgh Jul 25 '19 at 15:16
  • 1
    thanks for all your help. I have accepted your answer, as it got me on the right track. I had to change my regex to this: `/(?:<(?:textarea|select)[\s\S]*?>[\s\S]*?)?({{variable:(.*?)}})[\s\S]*?(?:<\/(?:textarea|select)>)?|(?:<(?:input)[\s\S]*?)?{{variable:(.*?)}}(?:[\s\S]*?>)?/im` and then when looping through them, I checked with strpos if it contained the form element, and if so,I replaced `{{` temporarily with something else so that the string does not match, then did my replace, and replaced the temp replacement back to `{{`so that the correct string can display in my form element. – Kobus Myburgh Jul 25 '19 at 17:42