1

I have a sting which happens to be HTML, and I wish to delete specific sections of it serverside using PHP (no JavaScript/jQuery solutions please). The string will need to have certain identifiers in it to tag sections which might wish to be removed, and I will also have some variable which indicates which tagged sections should be removed. These indicator tags should not remain in the final modified string.

For instance, consider $html_1 where I included a capture attribute to tag the sections which might be deleted. Or $html_2 where I wrapped [capture] around the tags which might be deleted. Note that these were just two possible ways I thought of tagging the sections, and am okay with any other method which allows the string to be stored in a DB.

For both, I have a <h2> block, <h1> block, and <p> block where capture is used to indicate sections which may or may not be removed. Then given $modify which indicates which sections should or shouldn't be removed, how can I generate the new string which is equal to $html_new? I am thinking maybe a DOMDocument, str_replace, or regex solution might work, but not sure.

<?php

$html_1 = <<<EOT
<div>
    <div>
        <div>
            <h1 capture="a">bla bla bla</h1>
            <p>bla</p>
            <h2 capture="b">bla bla<span>bla</span></h2>
            <h1>bla bla bla bla</h1>
        </div>
    </div>
    <div>
        <p capture="c">bla bla bla</p>
        <h1>bla bla</h1>
    </div>
</div>
EOT;

$html_2 = <<<EOT
<div>
    <div>
        <div>
            [caption id="a"]<h1>bla bla bla</h1>[/caption]
            <p>bla</p>
            [caption id="b"]<h2>bla bla<span>bla</span></h2>[/caption]
            <h1>bla bla bla bla</h1>
        </div>
    </div>
    <div>
        [caption id="c"]<p>bla bla bla</p>[/caption]
        <h1>bla bla</h1>
    </div>
</div>
EOT;

$modify=array('a'=>true,'b'=>false,'c'=>true);

$html_new = <<<EOT
<div>
    <div>
        <div>
            <p>bla</p>
            <h2>bla bla</h2>
            <h1>bla bla bla bla</h1>
        </div>
    </div>
    <div>
        <h1>bla bla</h1>
    </div>
</div>
EOT;
?>
user1032531
  • 24,767
  • 68
  • 217
  • 387
  • Have you tried anything yourself? Looks like a pretty simple regex pattern to me. – ksbg Jun 02 '15 at 13:44
  • @treegarden I am pretty weak with regex. My difficulty would be differentiating between the `a`, `b`, and `c` tag. I was probably going to go down the `DOMdocument` solution, but maybe that isn't the right way to go. – user1032531 Jun 02 '15 at 13:46
  • HTML with regex? See [here](http://stackoverflow.com/a/1732454/1864610). DOMdocument is exactly the way to go. –  Jun 02 '15 at 13:46
  • @HoboSapiens A little melodramatic, but fun post! I still feel regex works with very defined cases, but am not claiming it should be used for my current need. Thanks! – user1032531 Jun 02 '15 at 13:53
  • 1
    @HoboSapiens http://meta.stackoverflow.com/questions/261561 – Sebastian Simon Jun 02 '15 at 13:54

2 Answers2

1

I used $html_2, because I felt it's easier. That should do the trick:

foreach($modify as $letter=>$remove) {
    $pattern = '/\[caption id="' . $letter . '"\](.*)\[\/caption\]/U';
    $replace = ($remove) ? '' : '$1';
    $html_2 = preg_replace($pattern, $replace, $html_2);
}
$html_2 = preg_replace('/^\h*\v+/m', '', $html_2); // Optional: Removing empty lines

In case $remove is false for a certain letter, the matched part of the string get's replaced with the first capture group (which is everything surrounded by the capture tags). If it's true, it get's replaced with an empty string.

ksbg
  • 3,214
  • 1
  • 22
  • 35
  • Given the very unique deliminator `[caption...`, I would expect that this won't corrupt the HTML. Agree? – user1032531 Jun 02 '15 at 14:04
  • Well, all `[caption...]` tags will get removed on server-side before it's send to the client and the HTML is rendered, so you don't need to worry about that :) – ksbg Jun 02 '15 at 14:08
  • And no, given the tags' uniqueness you also don't need to worry about the regex messing up anything else. – ksbg Jun 02 '15 at 14:11
0

You could use preg_replace to replace any line containing capture="a" with a blank line, like this:

$stripped = preg_replace(/^.*(capture="a").*$/, '', $html_1);

If you encased this in a function, you could pass an argument to strip out a, b, or c:

function strip($capture,$block){
    $stripped = preg_replace(/^.*(capture="'.$capture.'").*$/, '', $block);
    return $stripped;
}