0

I am using simple_html_dom.php from http://simplehtmldom.sourceforge.net to obtain the complete urls of all images on a Wikipedia page. I'm searching mostly for companies and organisations. The script below works for a few but I get Fatal error: Call to a member function find() on a non-object... for many searches in this example YouTube and also if I try Facebook amongst others. I am aware that it because the $html is not an object. What is the method which is going to have the most success in returning the urls. Please see the code below. Any help is greatly appreciated.

<html>
<body>
<h2>Search</h2>
<form method="post">
Search: <input type="text" name="q" value="YouTube"/>
<input type="submit" value="Submit">
</form>

<?php

include 'simple_html_dom.php'; 

if (isset($_POST['q'])) 
    {
    $search = $_POST['q'];
    $search = ucwords($search);
    $search = str_replace(' ', '_', $search);  
    $html = file_get_html("http://en.wikipedia.org/wiki/$search");

    ?>
    <h2>Search results for '<?php echo $search; ?>'</h2>
    <ol>
        <?php

        foreach ($html->find('img') as $element): ?>

        <?php $photo = $element->src;

        echo $photo;

        ?>              

        <?php endforeach; 
    ?>
    </ol>
<?php 
}
?>
</body>
</html>

I have now followed advice in comments below (though I'm probably making a mistake) and encounter errors when I click Submit along the lines of:

Warning: DOMDocument::loadHTMLFile(): ID ref_media_type_table_note_2 already defined in http://en.wikipedia.org/wiki/YouTube, line: 270 in...

Warning: DOMDocument::loadHTMLFile(): ID ref_media_type_table_note_2 already defined in http://en.wikipedia.org/wiki/YouTube, line: 501 in...

Please see my amended code below:

<html> 
<body> 
    <form method="post"> Search: 
        <input type="text" name="q" value="YouTube"/> 
        <input type="submit" value="Submit"> </form> 
            <?php 
            if (isset($_POST['q'])) 
                { $search = $_POST['q'];
                  $search = ucwords($search); 
                  $search = str_replace(' ', '_', $search); 
                  $doc = new DOMDocument(); 
                  $doc->loadHTMLFile("http://en.wikipedia.org/wiki/$search"); 

                  foreach ($doc->getElementsByTagName('img') as $image) 
                     echo $image->getAttribute('src'); 

                } 
                ?>
</body> 
</html>
Oroku
  • 443
  • 1
  • 3
  • 15
  • 4
    I'd steer well clear of SimpleHTMLDom and stick to the very mature and well maintained built-in DOM extension. `$doc = new DOMDocument(); $doc->loadHTMLFile("http://en.wikipedia.org/wiki/$search")` should get you going – Phil Dec 04 '14 at 00:04
  • @phil thanks very much and sorry for the ignorance but how would I implement that to get the url images? – Oroku Dec 04 '14 at 00:35
  • `foreach ($doc->getElementsByTagName('img') as $image) echo $image->getAttribute('src');` – Phil Dec 04 '14 at 00:43
  • @phil sorry I'm definitely making a mistake as I keep getting errors with the following code:
    Search:
    loadHTMLFile("http://en.wikipedia.org/wiki/$search"); foreach ($doc->getElementsByTagName('img') as $image) echo $image->getAttribute('src'); } ?>
    – Oroku Dec 04 '14 at 00:54
  • Don't put great chunks of code in the comments, edit your question. What errors are you getting? – Phil Dec 04 '14 at 02:16
  • @phil please see amended code added now – Oroku Dec 04 '14 at 10:05
  • Did you think of using [their api](http://stackoverflow.com/a/627606/1519058) or it does not meet your needs ? – Enissay Dec 04 '14 at 13:07
  • @Enissay yeah I tried it but wasn't able to get the data I wanted. Basically I'm trying to get the urls of all images on any particular Wikipedia page. There may be a way to get it using the API but I can't see how. Please let me know if there is – Oroku Dec 04 '14 at 13:39
  • @Oroku those warnings are just about Wiki's crappy HTML. There might be a way to suppress them via options but you can always use `@$doc->loadHTMLFile($url)` – Phil Dec 04 '14 at 21:56
  • Found this answer which might help you ~ http://stackoverflow.com/a/2847141/283366 – Phil Dec 04 '14 at 22:04

1 Answers1

1
  • Those warnings are safe to ignore.
  • You can suppress them with an @ in front of the function.
  • The file_get_html issues can probably be resolved by switching to curl.
pguardiario
  • 53,827
  • 19
  • 119
  • 159