0

How can I read all data with a crawler from a page that has 2 html tags, for example:

<html>
<body>
text text text
</body>
</html>



text2 text2 text2 text
</body>
</html>

I need to replace the first closing html and body tags, and then to read all data. How do I do that?

Yury Fedorov
  • 14,508
  • 6
  • 50
  • 66
Arthur
  • 149
  • 1
  • 11

1 Answers1

0

You can use regular expressions to replace the first appearance of </body></html>, if there is one more pair of same tags after that:

// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);

Here you look for </body> and </html> separated by any number of white space characters (e.g. new line). Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body> and </html> tags after them.

To read "all the data" (assuming that it means everything between the <body> tags), you may use another regex E.g:

// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;

Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body> and after </body>, and the tags themselves), DOM document functionality, etc.

Yury Fedorov
  • 14,508
  • 6
  • 50
  • 66