How to find and extract "main" image in website

Question

I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).

How should I approach this? Are there any libraries that could help me with this? Thanks!

[jsoup](http://jsoup.org/).... – MadProgrammer Aug 16 '13 at 07:51 — MadProgrammer, Aug 16 '13 at 07:51

mqchen · Accepted Answer · 2013-08-16T09:07:34.417

OPTION 1

You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.

It used to be a Java project but rewritten to Scala.

From the readme

Goose will try to extract the following information:

Main text of an article

Main image of article

Any Youtube/Vimeo movies embedded in article

Meta Description

Meta tags

Publish Date

Try it here: http://jimplush.com/blog/goose

OPTION 2

You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the img element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.

OPTION 3

Use a library like jsoup that helps you parse HTML. Then get the value from the src attribute from all img tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.

Thanks for the answer. The problem is, that it says, "Goose is meant to work with individual articles, not homepages", which is kind of the opposite of what I need. — Idan, Aug 16 '13 at 08:07
@nodwj I have updated my answer with two new suggestions for possible approaches. — mqchen, Aug 16 '13 at 09:00

score 5 · Answer 2 · answered Jan 27 '16 at 11:52

Another solution would be to extract the meta tags for social media sharing first, if they are present, you are lucky otherwise you stil can try the other solutions.

<meta property="og:image" content="http://www.example.com/image.jpg"/>
<meta name="twitter:image" content="http://www.example.com/image.jpg">
<meta itemprop="image" content="http://www.example.com/image.jpg">

If you are yousing JSOUP the code would be like that:

    String imageUrlOpenGraph = document.select("meta[property=og:image]").stream()
            .findFirst()
            .map(doc -> doc.attr("content").trim())
            .orElse(null);

    String imageUrlTwitter = document.select("meta[name=twitter:image]").stream()
                .findFirst()
                .map(doc -> doc.attr("content").trim())
                .orElse(null);

    String imageUrlGooglePlus = document.select("meta[itemprop=image]").stream()
                .findFirst()
                .map(doc -> doc.attr("content").trim())
                .orElse(null);

score 1 · Answer 3 · answered Jan 30 '14 at 20:57

1

You could use a service like embedly. Among a lot of other information they allow you to extract the main image of any page. Works particularly well for articles. You can try it here.

answered Jan 30 '14 at 20:57

lex82

11,173
2
44
69

score 0 · Answer 4 · edited May 23 '17 at 11:46

0

You need artificial intelligence to do so, Computer Vision namely. It too big to fit in an answer. This link might help

If you are a mathematician with experience of Probability and Bayes rule, then you can just take the unit called Image Processing and Computer Vision.

If you are looking for available software you want to use check this out...

This stackoverflow thread might help...

There's this software called moodstocks which might help.

edited May 23 '17 at 11:46

Community

1
1

answered Aug 16 '13 at 07:54

Anshu Dwibhashi

4,617
3
28
59

Is there a heuristic to do it more simply? (even with some cost of accuracy?) – Idan Aug 16 '13 at 07:56
no mate, accept the fact. How on earth do you think can you detect images without intelligence? – Anshu Dwibhashi Aug 16 '13 at 07:56
Let me make my question more clear: I need help creating that so called intelligence (AI), and my goal is a rather simple and short algorithm even if not 100% accurate. – Idan Aug 16 '13 at 08:01

score 0 · Answer 5 · answered Sep 26 '16 at 07:18

0

ImageResolver can do that for you without the need of server side interaction, except for a small proxy script.

answered Sep 26 '16 at 07:18

Ma'moon Al-Akash

4,445
1
20
16

How to find and extract "main" image in website

5 Answers5

Linked