0

I am working on a php script for a custom cms that will replace a custom tag with information from a database.

There would be a tag like below

<!-- NAV id="123" suffix="somethinghere" prefix="somethingelse" --> 

I need to pull out the id, suffix, and prefix attributes. The code below works great if there is only one instance of this tag on the page, but if I have more than one, or if "-->" is anywhere else on the page it does not work properly. It matches everything between the first

"<!--"

and the last

"-->" 

instead of returning each match separately.

Here is my current code. If it were working properly it would replace the entire tag with the value of "id", eventually that will be data from the database.

<?php
global $lastNav, $html;

//the html content
$html = '<html><body><hr><br>Hi this is my content<br> <!-- NAV id="123" suffix="<br />" prefix="&bull;" --> <br>Some more content here <!-- NAV id="125" suffix="<br />" prefix="&bull;" -->     </body></html>';

$regexNavPattern = '<!-- NAV.*?(?:(?:\s+(id)="([^"]+)")|(?:\s+(prefix)="([^"]+)")|(?:\s+(suffix)="([^"]+)")|(?:\s+[^\s]+))+.*-->';

preg_replace_callback($regexNavPattern, "parseNav", $html);
function parseNav($navData) {
    global $lastNav, $html;

    foreach($navData as $key=>$value) {
        if($key == 0) { $lastNav['replace'] = '<'.$value.'>'; }
        if($value == 'id')     { $lastNav['id']     = $navData[$key+1]; }   
        if($value == 'prefix') { $lastNav['prefix'] = $navData[$key+1]; }   
        if($value == 'suffix') { $lastNav['suffix'] = $navData[$key+1]; }   
    }

    $html = str_replace($lastNav['replace'], $lastNav['id'], $html);
}

echo $html;
?>

At this point I am not concerned about case sensitivity. There is a chance that the attributes may contain special characters including single or double quotes.

Hopefully I explained this well enough. Thanks in advance.

Developer Gee
  • 362
  • 3
  • 12
  • Your `.*` at the end should be changed to non greedy like `.*?`. – Jonathan Kuhn Nov 04 '14 at 23:04
  • I originally had the question mark in there at the end, and I still have the same problem. It doesn't seem to have any impact. This was my original expression – Developer Gee Nov 04 '14 at 23:05
  • Obligatory: [Regex isn't the right tool to parse html](http://stackoverflow.com/a/1732454/52598) – Lieven Keersmaekers Nov 04 '14 at 23:08
  • `[^\s]+` should be `[^\s]+?` because it will match everything including the closing `-->`. Plus, you are missing your delimiters. Just add a slash before and after the whole pattern and you should be fine. And lastly, if id, suffix and prefix all follow the same pattern of `XX="some quoted value"`, you could remove most of that with something like `\s+(id|prefix|suffix)="([^"]+)"` – Jonathan Kuhn Nov 04 '14 at 23:17
  • Lieven - The link suggests that I use an xml parser. I do not have control over the templates and cannot guarantee they will follow an xhtml structure. If the html is sloppy and xml parser will not be able to handle it either. – Developer Gee Nov 04 '14 at 23:17
  • Jonathan I appreciate the help so far. I am terribly weak with regular expressions so I need a bit more help. I like the simpler approach, and yes they will all follow that pattern. Can you elaborate on the \s+(id|prefix|suffix)="([^"]+)" How would I merge that with the search for – Developer Gee Nov 04 '14 at 23:24
  • So that optimization I talked about won't work. It lacks groups to capture all the data or else you will just get the last one because it will overwrite. Your pattern probably would work fine though with the other fixes. – Jonathan Kuhn Nov 04 '14 at 23:42
  • 1
    IMO, this would be better broken out over a few functions like one that does `preg_match_all('//', $html, $matches);` and returns `$matches` which would be an array of the html comment tags. Then loop over that return array and call another function that tries to match up any attributes like `preg_match_all('/\w+="[^"]+"/', $tag, $matches);` and build/returns an array of `key=>value` for anything found. Then you can use that return array to find id, prefix, suffix and any other attributes that were set. It would make for much cleaner code than a single regex. – Jonathan Kuhn Nov 04 '14 at 23:55

1 Answers1

0

Jonathan Kuhn's solutions worked. For the time being I went with the first approach of just correcting the existing regex.

/<!-- NAV.*?(?:(?:\s+(id)="([^"]+)")|(?:\s+(prefix)="([^"]+)")|(?:\s+(suffix)="([^"]+)‌​")|(?:\s+[^\s]?+))+.*?-->/

Later I will modify it to break it down to work as a few functions. I appreciate the help.

Developer Gee
  • 362
  • 3
  • 12