You can use regular expressions to replace the first appearance of </body></html>
, if there is one more pair of same tags after that:
// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);
Here you look for </body>
and </html>
separated by any number of white space characters (e.g. new line). Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body>
and </html>
tags after them.
To read "all the data" (assuming that it means everything between the <body>
tags), you may use another regex E.g:
// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;
Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body>
and after </body>
, and the tags themselves), DOM document functionality, etc.