scrapr image url from wikipedia page

Question

I created regex which gives image url from the source code of the page.

<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
    if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
        echo "First";
        return $matches[0][0];
    } else {
        if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
            echo "Second";
            return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
        } else
            return null;
    }
}

But for wikipedia page image url is like this

http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg which always fails in my regex.

How can I get rid of this?

[Don't use a regular expression to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)! Instead use XPath, which would yield your result with ease. — nietonfir, Dec 13 '13 at 13:59
You should ***not*** use regular expressions to parse HTML or XML. Instead, you should use a tool like [Simple HTML DOM](http://simplehtmldom.sourceforge.net/) that provides the proper capability to parse these types of files. — War10ck, Dec 13 '13 at 14:24
+1 for each of the suggestions above. DOM parsing is often easier to implement, read, understand and maintain. Also, as with any website content scraping, be sure to check the target website's terms of use to ensure you're not violating them. — daiscog, Dec 13 '13 at 14:33

AeroX · Accepted Answer · 2013-12-16T10:44:24.940

4

Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.

<?php
$doc = new DOMDocument();
@$doc->loadHTMLfile( "http://www.wikipedia.org/" );

$images = $doc->getElementsByTagName("img");

foreach( $images as $image ) {
    echo $image->getAttribute("src");
    echo "<br>";
}

?>

edited Dec 16 '13 at 10:44

answered Dec 13 '13 at 14:33

AeroX

3,387
2
25
39

2

Why don't you use `$image->getAttribute('src')` instead of using a foreach loop for all attributes? – Casimir et Hippolyte Dec 13 '13 at 14:49
@AeroX: Can you please modify it to get only one imag url? – user2129623 Dec 13 '13 at 15:25
1

@Programming_crazy: The above code give you the content of the src attribute inside img tags. It is only an example for the first step of the code, it is not the gift-wrapped solution. – Casimir et Hippolyte Dec 13 '13 at 16:16
@CasimiretHippolyte Thanks for the tip on `$image->getAttribute('src')`. Edited answer to use that method. – AeroX Dec 16 '13 at 10:38

scrapr image url from wikipedia page

1 Answers1

Linked