Php crawler reading all data from 2 htmls

Question

How can I read all data with a crawler from a page that has 2 html tags, for example:

<html>
<body>
text text text
</body>
</html>



text2 text2 text2 text
</body>
</html>

I need to replace the first closing html and body tags, and then to read all data. How do I do that?

Possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) — Beloo, Dec 22 '16 at 08:31

score 0 · Answer 1 · answered Dec 22 '16 at 10:22

You can use regular expressions to replace the first appearance of </body></html>, if there is one more pair of same tags after that:

// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);

Here you look for </body> and </html> separated by any number of white space characters (e.g. new line). Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body> and </html> tags after them.

To read "all the data" (assuming that it means everything between the <body> tags), you may use another regex E.g:

// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;

Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body> and after </body>, and the tags themselves), DOM document functionality, etc.

Php crawler reading all data from 2 htmls

1 Answers1