0

I'm trying to remove a HTML element from a string,

I have the following preg_replace ;

    $body = preg_replace('#<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">(.*?)</div>#', '', $body);

But the preg_replace doesn't seem to work;

Here is the full code;

    $html = new DOMDocument();
     @$html->loadHtmlFile($url);
     $xpath = new DOMXPath( $html );
     $nodelist = $xpath->query( '//*[@class="coincodex-content"]' );
     $body = '';
    foreach ($nodelist as $n){
        $body .= $html->saveHtml($n)."\n";
    } 
    
    $body = preg_replace('#<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">(.*?)</div>#', '', $body);
    

The current output is this;

<div class="coincodex-content">
hello this is content
<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">
<div><center><span style="font-size:11px; color: gray;"TEST</span></center>
<b>TEST</b><br><br></div></div>
<div class="rp4wp-related-posts rp4wp-related-post">
    </ul></div><!-- AI CONTENT END 1 -->
<div class="entry-tags" style="margin-bottom:15px; font-weight: bold; text-align:center;">Tags: <a href="#" rel="tag">test</a> <a href="#" rel="tag">#tag</a></div>
</div>

And my desired output is ;

<div class="coincodex-content">
hello this is content
</div>

I really appreciate any help I'm sure there is an easier way to achieve this I'm just not entirely sure why my current method is not working thankyou.

Lewis
  • 170
  • 7
  • What do you start with? Likely would be best off not using a regex and only parse the data you want. – user3783243 Mar 28 '22 at 23:48
  • 3
    Use DOM methods to locate and remove the `
    `. Also, read [this cautionary tale of eldritch horrors](https://stackoverflow.com/a/1732454/283366)
    – Phil Mar 28 '22 at 23:49
  • So you only want first textnode of `coincodex-content`? – user3783243 Mar 28 '22 at 23:50
  • 1
    Problem 1: you have multiple `` tags in your code, so your regex `.*` would not extend to the last `` in your input. Problem 2: why the `#` ? – Nic3500 Mar 28 '22 at 23:57
  • 2
    @Nic3500 You can use any character for the regex delimiter in PHP so it's often handy to pick one that won't appear in the pattern to avoid extra escaping – Phil Mar 28 '22 at 23:58

2 Answers2

1

Regular expressions are unsuitable for modifying DOM elements. Your experiment shows that. The result is wrong and also invalid HTML.

You can better use DOM methods to solve the problem as noted in the comment. DOM has a method DOMNode::removeChild which you can use to remove elements. To show how removeChild can be used I chose simpler HTML.

$html = <<<HTML
<div>
<div class="coincodex-content">
hello this is content
  <div class="delete_this" style="margin: 8px 0; clear: both;">
    <div>
       <center><span style="font-size:11px; color: gray;">TEST</span></center>
       <b>TEST</b><br><br>
     </div>
  </div>
  <div class="preserved">
    Test2
  </div>
</div>
</div>
HTML;

I collect the fragments into an array.

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query( '//*[@class="coincodex-content"]' );

$fragment = [];
foreach($nodelist as $contentNode){
  $removeNodelist = $xpath->query('//div[@class="delete_this"]',$contentNode); 
  $item = $removeNodelist->item(0);  //only first
  $item->parentNode->removeChild($item); 
  $fragment[] = $doc->saveHTML($contentNode); 
}

The result in fragment[0] :

<div class="coincodex-content">
hello this is content
  
  <div class="preserved">
    Test2
  </div>
</div>

Try it yourself at 3v4l.org.

jspit
  • 7,276
  • 1
  • 9
  • 17
-1

This is cheating a bit. The main problem with trying to use regex to parse HTML is the nesting tags, which will drive you to madness. If you truly only need to keep the first <div> and the content that occurs before the second <div>, the below will work.

preg_match('#<div class="coincodex-content">(.*)<div.*$#Us', $body, $matches);
$body = '<div class="coincodex-content">' . $matches[1] . '</div>';

... since we're just extracting the content we need, and inserting it into the content format that's static.

Foul

FoulFoot
  • 655
  • 5
  • 9