Screen scraping with cURL and Regex

Question

Consider a document in the following format:

<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
   <div class="blog_post_item first">
       <?php // some child elements ?>
   </div><!-- end blog_post_item -->
</body>
</html>

I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match to find the opening and closing tags; they will always look the same, including that ending comment.

I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.

I have some simple PHP cURL code:

$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);

Of course, now $output contains the entire source. How will I get just the contents of that element?

Tim S. · Answer 1 · 2012-08-03T10:28:06.390

3

That's quite easy if you are sure the begin and end is ALWAYS the same. All you have to do is search for the beginning and end and match everything between that. I think a lot of people will be pissed at me for using regex to find a bit of HTML but it'll do the job!

// cURL
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);

if(empty($output)) exit('Couldn\'t download the page');

// finding your data
$pattern = '/<div class="blog_post_item first">(.*?)<\/div><!-- end blog_post_item -->/';

preg_match_all($pattern, $output, $matches);
var_dump($matches); // all matches

Because I don't know which website you're trying to crawl I'm not sure if this works or not.

After searching for quite a while (26 minutes to be exact) I have found why it didn't work. The dot (.) doesn't match newlines. Because HTML is full of new lines, it couldn't match the contents. Using a slightly dirty hack I managed to get it matching anyway (even though you already picked an answer).

// cURL
$ch = curl_init('http://blogg.oscarclothilde.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);

if(empty($output)) exit('Couldn\'t download the page');

// finding your data
$pattern = '/<div class="blog_post_item first">(([^.]|.)*?)<\/div><!-- end blog_post_item -->/';

preg_match_all($pattern, $output, $matches);
var_dump($matches[1][0]); // all matches

edited Aug 03 '12 at 10:28

answered Aug 03 '12 at 09:47

Tim S.

13,597
7
46
72

It's a good start, but `var_dump($matches);` is giving me `array(2) { [0]=> array(0) { } [1]=> array(0) { } }` – Jezen Thomas Aug 03 '12 at 09:53
can I get an URL so I can give it a go myself? "a.web.page.com" clearly doesn't exist. Also, I added an extra check to see if you got a response in the first place. – Tim S. Aug 03 '12 at 09:54
I don't want to link to the site in public. I had a look at your site, but couldn't find contact information. You could email me and I can return the URL. – Jezen Thomas Aug 03 '12 at 09:58
@JezenThomas used the contact form on your website – Tim S. Aug 03 '12 at 10:00
+1 is the best I can do. Shame I can't mark two answers as correct! – Jezen Thomas Aug 03 '12 at 12:29
@JezenThomas no problem mate, I'm glad your problem is solved! – Tim S. Aug 03 '12 at 20:07
Sometimes xpath isnt the best option. if other monkeys code the page horribly then there may not be a reliable structure to parse, however if you you KNOW that particular ID will 'always' be there... – roberthuttinger Jan 17 '13 at 14:44

score 2 · Answer 2 · answered Aug 03 '12 at 09:56

2

This piece of code should work (>= 5.3.6 and dom extension):

$s = <<<EOM
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
   <div class="blog_post_item first">
       <?php // some child elements ?>
   </div><!-- end blog_post_item -->
</body>
</html>
EOM;

$d = new DOMDocument;
$d->loadHTML($s);

$x = new DOMXPath($d);

foreach ($x->query('//div[contains(@class, "blog_post_item") and contains(@class, "first")]') as $el) {
        echo $d->saveHTML($el);
}

answered Aug 03 '12 at 09:56

Ja͢ck

170,779
38
263
309

ext DOM is the right way to go. Personally I'd use cURL to grab the target url output and cache it in one operation, then analyse it as @Jack explains in another. I.e. isolate the various latency problems associated with url grabbing (server down, timeouts, line problems etc) – Cups Aug 03 '12 at 10:19
Sadly, it threw me a bunch of warnings. – Jezen Thomas Aug 03 '12 at 10:24
@JezenThomas those warnings can be silenced by using `libxml_use_internal_errors(true)`. – Ja͢ck Aug 03 '12 at 13:07

score 2 · Accepted Answer · answered Aug 03 '12 at 10:07

If you are sure about the following structure:

<div class="blog_post_item first">
   WHATEVER
</div><!-- end blog_post_item -->

AND you are sure the ending-code doesn't appear in WHATEVER, then you can simply grab it.

(Note please that I replaced your original PHP with WHATEVER. CURL will only fetch the HTML, and it will contain content, not PHP.)

You don't need a regex. You can also do it simply by searching for the wanted strings, like in my example below.

$curlResponse = '
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
   <div class="blog_post_item first">
       <?php // some child elements ?>
   </div><!-- end blog_post_item -->
</body>
</html>';

$startStr = '<div class="blog_post_item first">';
$endStr = '</div><!-- end blog_post_item -->';

$startStrPos = strpos($curlResponse, $startStr)+strlen($startStr);
$endStrPos = strpos($curlResponse, $endStr);

$wanted = substr($curlResponse, $startStrPos, $endStrPos-$startStrPos );

echo htmlentities($wanted);

This is very simple to understand. I like it! Thanks! – Jezen Thomas Aug 03 '12 at 10:20 — Jezen Thomas, Aug 03 '12 at 10:20

Screen scraping with cURL and Regex

3 Answers3