Simple html dom file_get_html not working - is there a more robust way that will handle most cases

Question

I am using simple_html_dom.php from http://simplehtmldom.sourceforge.net to obtain the complete urls of all images on a Wikipedia page. I'm searching mostly for companies and organisations. The script below works for a few but I get Fatal error: Call to a member function find() on a non-object... for many searches in this example YouTube and also if I try Facebook amongst others. I am aware that it because the $html is not an object. What is the method which is going to have the most success in returning the urls. Please see the code below. Any help is greatly appreciated.

<html>
<body>
<h2>Search</h2>
<form method="post">
Search: <input type="text" name="q" value="YouTube"/>
<input type="submit" value="Submit">
</form>

<?php

include 'simple_html_dom.php'; 

if (isset($_POST['q'])) 
    {
    $search = $_POST['q'];
    $search = ucwords($search);
    $search = str_replace(' ', '_', $search);  
    $html = file_get_html("http://en.wikipedia.org/wiki/$search");

    ?>
    <h2>Search results for '<?php echo $search; ?>'</h2>
    <ol>
        <?php

        foreach ($html->find('img') as $element): ?>

        <?php $photo = $element->src;

        echo $photo;

        ?>              

        <?php endforeach; 
    ?>
    </ol>
<?php 
}
?>
</body>
</html>

I have now followed advice in comments below (though I'm probably making a mistake) and encounter errors when I click Submit along the lines of:

Warning: DOMDocument::loadHTMLFile(): ID ref_media_type_table_note_2 already defined in http://en.wikipedia.org/wiki/YouTube, line: 270 in...

Warning: DOMDocument::loadHTMLFile(): ID ref_media_type_table_note_2 already defined in http://en.wikipedia.org/wiki/YouTube, line: 501 in...

Please see my amended code below:

<html> 
<body> 
    <form method="post"> Search: 
        <input type="text" name="q" value="YouTube"/> 
        <input type="submit" value="Submit"> </form> 
            <?php 
            if (isset($_POST['q'])) 
                { $search = $_POST['q'];
                  $search = ucwords($search); 
                  $search = str_replace(' ', '_', $search); 
                  $doc = new DOMDocument(); 
                  $doc->loadHTMLFile("http://en.wikipedia.org/wiki/$search"); 

                  foreach ($doc->getElementsByTagName('img') as $image) 
                     echo $image->getAttribute('src'); 

                } 
                ?>
</body> 
</html>

I'd steer well clear of SimpleHTMLDom and stick to the very mature and well maintained built-in DOM extension. `$doc = new DOMDocument(); $doc->loadHTMLFile("http://en.wikipedia.org/wiki/$search")` should get you going — Phil, Dec 04 '14 at 00:04
@phil thanks very much and sorry for the ignorance but how would I implement that to get the url images? — Oroku, Dec 04 '14 at 00:35
`foreach ($doc->getElementsByTagName('img') as $image) echo $image->getAttribute('src');` — Phil, Dec 04 '14 at 00:43
@phil sorry I'm definitely making a mistake as I keep getting errors with the following code:
Search:
loadHTMLFile("http://en.wikipedia.org/wiki/$search"); foreach ($doc->getElementsByTagName('img') as $image) echo $image->getAttribute('src'); } ?> — Oroku, Dec 04 '14 at 00:54
Don't put great chunks of code in the comments, edit your question. What errors are you getting? — Phil, Dec 04 '14 at 02:16
Did you think of using [their api](http://stackoverflow.com/a/627606/1519058) or it does not meet your needs ? — Enissay, Dec 04 '14 at 13:07
@Enissay yeah I tried it but wasn't able to get the data I wanted. Basically I'm trying to get the urls of all images on any particular Wikipedia page. There may be a way to get it using the API but I can't see how. Please let me know if there is — Oroku, Dec 04 '14 at 13:39
@Oroku those warnings are just about Wiki's crappy HTML. There might be a way to suppress them via options but you can always use `@$doc->loadHTMLFile($url)` — Phil, Dec 04 '14 at 21:56
Found this answer which might help you ~ http://stackoverflow.com/a/2847141/283366 — Phil, Dec 04 '14 at 22:04

score 1 · Answer 1 · answered Dec 05 '14 at 01:27

1

Those warnings are safe to ignore.
You can suppress them with an @ in front of the function.
The file_get_html issues can probably be resolved by switching to curl.

answered Dec 05 '14 at 01:27

pguardiario

53,827
19
119
159

Simple html dom file_get_html not working - is there a more robust way that will handle most cases

1 Answers1