0

I created regex which gives image url from the source code of the page.

<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
    if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
        echo "First";
        return $matches[0][0];
    } else {
        if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
            echo "Second";
            return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
        } else
            return null;
    }
}

But for wikipedia page image url is like this

http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg which always fails in my regex.

How can I get rid of this?

user123
  • 5,269
  • 16
  • 73
  • 121
  • 5
    [Don't use a regular expression to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)! Instead use XPath, which would yield your result with ease. – nietonfir Dec 13 '13 at 13:59
  • 1
    You should ***not*** use regular expressions to parse HTML or XML. Instead, you should use a tool like [Simple HTML DOM](http://simplehtmldom.sourceforge.net/) that provides the proper capability to parse these types of files. – War10ck Dec 13 '13 at 14:24
  • 2
    +1 for each of the suggestions above. DOM parsing is often easier to implement, read, understand and maintain. Also, as with any website content scraping, be sure to check the target website's terms of use to ensure you're not violating them. – daiscog Dec 13 '13 at 14:33

1 Answers1

4

Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.

<?php
$doc = new DOMDocument();
@$doc->loadHTMLfile( "http://www.wikipedia.org/" );

$images = $doc->getElementsByTagName("img");

foreach( $images as $image ) {
    echo $image->getAttribute("src");
    echo "<br>";
}

?>
AeroX
  • 3,387
  • 2
  • 25
  • 39