Consider a document in the following format:
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first
and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match
to find the opening and closing tags; they will always look the same, including that ending comment.
I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.
I have some simple PHP cURL code:
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);
Of course, now $output
contains the entire source. How will I get just the contents of that element?