0

I am collecting all images from web page. But as there might be some icons with .png which are also considered as image.

Is that possible for me to show only real images, not a icons or favicon on page?

HEre is my simple script

function get_logo($html,$url) 
{
    $url = rtrim($url, '/');
    if (strpos($url,'wikipedia') !== false)
        return "http://upload.wikimedia.org/wikipedia/commons/5/53/Wikipedia-logo-en-big.png";
    else if(preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches))
    {
        return $matches;
    }
    else
    {                   
        preg_match_all("/<img src=\"(.*?)\"/", $html, $matches);
        return $url.''.$matches[1][0];
    }
}   

one of the result:

array (size=1)
  0 => 
    array (size=16)
      0 => string 'http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png' (length=63)
      1 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      2 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      3 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      4 => string 'https://i.stack.imgur.com/uE37r.png' (length=34)
      5 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      6 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      7 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      8 => string 'https://i.stack.imgur.com/dmHl0.png' (length=34)
      9 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      10 => string 'https://i.stack.imgur.com/dmHl0.png' (length=34)
      11 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
      12 => string 'https://i.stack.imgur.com/uE37r.png' (length=34)
      13 => string 'https://i.stack.imgur.com/NG6TX.png' (length=34)
      14 => string 'https://i.stack.imgur.com/BfCOt.png' (length=34)
      15 => string 'https://i.stack.imgur.com/tKsDb.png' (length=34)
user123
  • 5,269
  • 16
  • 73
  • 121
  • What exactly is your definition of a "real" image versus an icon? – Patrick Q Jun 09 '14 at 13:54
  • Image positions is also important, if it is on top - probably logo, if on bottom - some random image, you don't need. Also try to ignore header, footer, sidebar, ads elements. And find main block that has lots of text and try to take first image - this is image you want. – ViliusL Jun 09 '14 at 14:31
  • @PatrickQ: real image is general image that appears on page as a part of content. Icons belongs to favicon, video icon ( like http://i.stack.imgur.com/NG6TX.png, http://i.stack.imgur.com/tKsDb.png), rating images etc. I have given sample list of such images also – user123 Jun 09 '14 at 15:29
  • Okay, but you have not provided a true _definition_ of what you want to include or exclude. Saying "I want to exclude icons" is not a definition. What are the specifications that make something an icon in your eyes? Will the definition hold true across all URLs that you will be pulling? Until you come up with some concrete rules defining what you want to include/exclude, the technology/functions/code used to do it is irrelevant. – Patrick Q Jun 09 '14 at 15:35

1 Answers1

1

You could getimagesize() and declare 2 limits, one for the width and one for the height. This might be a way to determine if an image is an icon (eg. 64 x 64 px) or a bigger, "real" image.

Community
  • 1
  • 1
RazvanZ
  • 218
  • 2
  • 10