using Simple Dom parser to Get content of wiki infobox

Question

I try to display the content of Wikipedia infobox using simple Dom Parser but it gives me problem. this is the code.`

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<?php
//The folder where you uploaded simple_html_dom.php
require_once('simple_html_dom.php');

//Wikipedia page to parse
$html = file_get_html('https://en.wikipedia.org/wiki/Burger_King');

foreach ( $html->find ( 'table[class=infobox vcard]' ) as $element ) {

    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


    $cells = $element->find('th');

    $i = 0;

    foreach($cells as $cell) {

        $right[$i] = $cell->plaintext;

        if (!(empty($right[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($right);

echo "<br><br><br>";

print_r ($left);

//If you want to know what kind of industry burger king is
//echo "Burger king is $right[2], $left[2]

}


?>

</body>
</html>

The code is not working on any other pages like https://en.wikipedia.org/wiki/United_Kingdom, it works ony using https://en.wikipedia.org/wiki/Burger_King. this is the error message I am getting Fatal error: Call to a member function find() on a non-object in C:\wamp\www\MyApps\Inbox.php on line 16

No! but I have now found how to enable it by using http://stackoverflow.com/questions/2305954/how-to-enable-https-stream-wrappers. — user3264002, Feb 10 '14 at 23:12
Thanks. It works but with only https://en.wikipedia.org/wiki/Burger_King. Do know why it is not working with other pages like https://en.wikipedia.org/wiki/India — user3264002, Feb 10 '14 at 23:17

Giacomo Pigani · Answer 1 · 2014-02-11T13:53:19.983

1: This code doesn't work for you because you are trying to fetch the table with class="infobox vcard", which is for companies, on a country page which is class="infobox geography vcard".

2: That thus isn't the only problem because you are running out of memory for sure.

Substitute

$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

with:

$url = 'https://en.wikipedia.org/wiki/United_Kingdom';

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$html = new simple_html_dom();
$html->load($curl_scraped_page, true, false);

And you should get something like

Fatal error: Out of memory (allocated XXX) (tried to allocate 40 bytes) 
in /simple_html_dom.php on line 1544

3: If you will be able to fix the previous problems you will also have to update your code, which probably won't work

Edit 1:

My favourite way to avoid this problem is to use google cache, which has a "only text" version. This usually avoids the need to store a huge amount of data, which is one of the things not making your code work. The major downside owever is that Google cache doesn't know that to do with thso what was inside just disapperars.

I'll look for an alternative, meanwhile here's the code XD

<?php

require_once('simple_html_dom.php');
//$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

    //q = website to fetch, leave "cache:"
    $url = 'http://webcache.googleusercontent.com/search?strip=1&q=cache:en.wikipedia.org/wiki/United_Kingdom';

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $curl_scraped_page = curl_exec($ch);

    $html = new simple_html_dom();
    $html->load($curl_scraped_page, true, false);


//echo $html;


foreach ( $html->find ( 'table[class=infobox geography vcard]' ) as $element ) {


    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($left);

}


?>

If I helped you (and I'm sure I did), mark as best answer and thumbs up :P

score 0 · Answer 2 · answered Feb 11 '14 at 02:55

0

I have found that the error comes from table[class=infobox vcard], this only retrieved content of table which class =Infobox

answered Feb 11 '14 at 02:55

user3264002

1
4

using Simple Dom parser to Get content of wiki infobox

2 Answers2

Edit 1: