preg_replace "gotcha" with replacement value escaping

Question

I'm running a preg_replace on content that I don't necessarily control and I'm running into an issue with replacement values containing things like currency values (i.e. $1.00). Admittedly this is a common problem that's been addressed in other questions. The closest solution I've found is:

http://www.procata.com/blog/archives/2005/11/13/two-preg_replace-escaping-gotchas/

My problem is more complicated because the replacement value is not something I can escape ahead of time, at least not in a way I can see. Here's my preg code:

$body = preg_replace('/<special_tag id="'.$tagID.'">(.*?)<\/special_tag>/','$1',$body);

As you can see I'm capturing all content within a set custom tag, and removing the surrounding opening and closing tags, but keeping the content found inside. The replacement '$1' however doesn't lend itself to the escaping that is required, and so currency values that happen to be in the replacement values are getting terminated incorrectly.

Have I over thought this replacement? Is there something else I can use to remove my special tags keeping in mind that it must take into account the unique ID for that specific tag?

Any help would be greatly appreciated!

"Is there something else I can use to remove my special tags": uuuh, [how about a DOM parser](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662)? — Gordon, Jan 03 '13 at 16:07
I actually used that method in another part of the same script but it seemed overkill for this particular replacement, just trying to keep overhead down if I can, but if this is the only option, I will fall back on it. — oucil, Jan 03 '13 at 16:15
it's not the *only* option, but it comes to (my) mind before approaching this with Regex. — Gordon, Jan 03 '13 at 16:17
@Gordon Seems the DOM is the preferred method, thanks for your input! I imagine I can rewrite my current stuff to use it more efficiently :) Since there are no other answers, if you're willing, can you create one with the same link for others, and I'll accept it. — oucil, Jan 04 '13 at 14:00
procata.com's solution is incomplete. The string $replacement = '$10+\\5' will not be handled correctly. — zylstra, Oct 29 '13 at 01:27

score 1 · Accepted Answer · answered Jan 04 '13 at 14:40

Possible DOM solution that shouldn't have any of the "gotchas".

Assuming this HTML:

$html = <<< HTML
<html>
    <body>
        <special_tag id="foo">
            <p>Some content</p>
            <p>Some more content</p>
        </special_tag>
    </body>
</html>
HTML;

You pull up the children of special_tag and remove special_tag afterwards:

// create DOMDocument, suppress parsing errors
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

// get special_tag with id foo
$xpath = new DOMXPath($dom);
$foo = $xpath->query('//special_tag[@id="foo"]')->item(0);

// move all children before special_tag
while ($foo->childNodes->length > 0) {
    $foo->parentNode->insertBefore($foo->childNodes->item(0));
}

// remove now empty special_tag
$foo->parentNode->removeChild($foo);

// output
echo $dom->saveHTML($dom->documentElement);

Will result in something like

<html><body>
    <p>Some content</p>
        <p>Some more content</p>
    </body></html>

Thanks for the sample code as well, I'm sure others will appreciate it, and I'll also say, it's probably a much smarter move heading towards DOM than relying as heavily as I have on preg functions. Cheers! — oucil, Jan 04 '13 at 18:23

score 0 · Answer 2 · edited May 23 '17 at 12:12

0

Using Regex to parse XML/HTML is not recommended. Use a DOM parser instead.

edited May 23 '17 at 12:12

Community

1
1

answered Jan 04 '13 at 14:39

Madara's Ghost

172,118
50
264
308

preg_replace "gotcha" with replacement value escaping

2 Answers2