Extracting relevant image from a web-page

Question

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website

score 8 · Answer 1 · answered Sep 16 '10 at 11:57

8

Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

Biggest resolution += 5pts
Biggest filesize += 10 pts
Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

answered Sep 16 '10 at 11:57

Toad

15,593
16
82
128

This is the classic approach and thank you for putting it down here. I was a bit hesitant to go down this path because I was not sure how long this will take. Like you said, it will probably work great after some tuning. Couple more factores that I found elsewhere are: 1] the path of the image. 2] images whose width and height are specified – mithun Sep 16 '10 at 16:04

score 5 · Accepted Answer · answered May 14 '16 at 06:59

It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.

urlmeta.org is pretty cool. Works for almost all ecommerce product pages. — vaichidrewar, Sep 20 '16 at 21:57

score 3 · Answer 3 · answered Sep 16 '10 at 11:52

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

Say the headline of the page I find is "this is a headline"
I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

Serkan · Answer 4 · 2010-09-16T08:21:55.440

1

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

edited Sep 16 '10 at 08:21

answered Sep 16 '10 at 08:16

Serkan

349
1
2
7

Well OGP is something Facebook is pushing so that they can extract meta-data accurately. Unfortunately, a large number of website do not follow this standard. – mithun Sep 16 '10 at 11:51

Extracting *relevant* image from a web-page

4 Answers4

Extracting relevant image from a web-page