How to 'scrape' content from a page's source?

Question

I have this code which gets the HTML source of a page:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

I want to scrape some content from it. For example, say the page's source contains this:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Is there a way I could scrape this from the source and store it in a variable, so it'll look like this:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

Of cause the page is dynamic which is why I'm having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!

You say the page is dynamic and that's a problem but i see a clear schema in the page: "siteURI"
"coonection result"

, does this change sometimes? — CaNNaDaRk, Sep 06 '11 at 14:26
Don't say "for example, say the page's source contains this", it either does or it doesn't! If it doesn't then any code provided to parse that particular HTML will be of no use! — fire, Sep 06 '11 at 14:27
Every time the source will contain each of them sites. I want to scrape it and the result which is after it. I used that above as an example, but sites that give the result "Connection failed" may tomorrow give the result "Done". Hope that makes sense. — Joey Morani, Sep 06 '11 at 14:31
you should use [regular expressions](http://www.php.net/manual/en/ref.pcre.php) — Maria, Sep 06 '11 at 14:29

score 15 · Accepted Answer · answered Sep 06 '11 at 14:27

I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

Then using code like this:

<?php
include_once 'simple_html_dom.php';

$url = "http://slashdot.org/";
$html = file_get_html($url);

//remove additional spaces
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";

foreach($html->find('h2') as $heading) { //for each heading
        //find all spans with a inside then echo the found text out
        echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
}
?>

This results in something like:

5.8 Earthquake Hits East Coast of the US
Origins of Lager Found In Argentina
Inside Oregon State University's Open Source Lab
WebAPI: Mozilla Proposes Open App Interface For Smartphones
Using Tablets Becoming Popular Bathroom Activity
The Syrian Government's Internet Strategy
Deus Ex: Human Revolution Released
Taken Over By Aliens? Google Has It Covered
The GIMP Now Has a Working Single-Window Mode
Zombie Cookies Just Won't Die
Motorola's Most Important 18 Patents
MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
Evangelical Scientists Debate Creation Story
Android On HP TouchPad
Google Street View Gets Israeli Government's Nod
Internet Restored In Tripoli As Rebels Take Control
GA Tech: Internet's Mid-Layers Vulnerable To Attack
Serious Crypto Bug Found In PHP 5.3.7
Twitter To Meet With UK Government About Riots
EU Central Court Could Validate Software Patents

this is very simple, thanks for the code. i just wrote 10 lines names scraper in 10 minutes. with checking headers for 200 OK response for more automated . — , Jun 14 '13 at 04:26
Do you think you could help me. I have used your code but cannot seem to do what I need. I need to get the HREF data to be displayed. — JamesG, Jan 04 '17 at 01:44

sanmai · Answer 2 · 2011-09-06T14:40:47.943

0

This isn't the best solution, but it works:

$page = file_get_contents('http://example.com/page.html');
preg_match_all('#<strong>([^<]+)</strong><br />\s*([^<]+)<#', $page, 
                                             $result, PREG_SET_ORDER);
foreach ($result as $row) {
    echo "<p><b>$row[1]</b> $row[2]</p>\n";
}

If need to scape something more complex, consider DOMDocument.

edited Sep 06 '11 at 14:40

answered Sep 06 '11 at 14:30

sanmai

29,083
12
64
76

moteutsch · Answer 3 · 2011-09-06T14:37:39.800

-3

You can use Regular Expressions.

Edit

Regex isn't the best solution for large problems, but for simple pages with a standard format, regex is often simplest to use.

edited Sep 06 '11 at 14:37

answered Sep 06 '11 at 14:27

moteutsch

3,741
3
29
35

You are right about using Regexp could be viable for small simple pages. But HTML is not always well formed and you cannot count on people closing tags correctly and so on, hence it can be quite hard to write Regexp that covers all cases on big messy pages. – Cheesebaron Sep 06 '11 at 20:58
That is a matter that may be debated at length. Personally, I have found that regex is often the simplest solution. You may disagree. Nevertheless, it certainly isn't so clear cut that it deserves a down-vote. – moteutsch Sep 07 '11 at 20:21

How to 'scrape' content from a page's source?

3 Answers3