10

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website

Community
  • 1
  • 1
mithun
  • 145
  • 1
  • 11

4 Answers4

8

Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

  • Biggest resolution += 5pts
  • Biggest filesize += 10 pts
  • Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

Toad
  • 15,593
  • 16
  • 82
  • 128
  • This is the classic approach and thank you for putting it down here. I was a bit hesitant to go down this path because I was not sure how long this will take. Like you said, it will probably work great after some tuning. Couple more factores that I found elsewhere are: 1] the path of the image. 2] images whose width and height are specified – mithun Sep 16 '10 at 16:04
5

It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.

3

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

  1. Say the headline of the page I find is "this is a headline"
  2. I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

mithun
  • 145
  • 1
  • 11
1

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

Serkan
  • 349
  • 1
  • 2
  • 7
  • Well OGP is something Facebook is pushing so that they can extract meta-data accurately. Unfortunately, a large number of website do not follow this standard. – mithun Sep 16 '10 at 11:51