Get all Images from WebPage Program | Java

Question

Currently I need a program that given a URL, returns a list of all the images on the webpage.

ie:

logo.png gallery1.jpg test.gif

Is there any open source software available before I try and code something?

Language should be java. Thanks Philip

How do you mean sorry? I just need a .jar file or something that I can link into an existing Java program Im writing. Id imagine the program would be fairly simple, I just need to operation of image extraction — Phil, Jan 31 '10 at 18:20
I don't think you will find any such library that exactly suits your scenario. You will have to use a parser and write some downloading code yourself. — craftsman, Jan 31 '10 at 18:26

BalusC · Answer 1 · 2010-01-31T18:45:54.390

14

Just use a simple HTML parser, like jTidy, and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI>.

You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. Here's a kickoff example:

InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();

for (int i = 0; i < imgs.getLength(); i++) {
    srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}

for (String src: srcs) {
    System.out.println(src);
}

I must however admit that HtmlUnit as suggested by Bozho indeed looks better.

edited Jan 31 '10 at 18:45

answered Jan 31 '10 at 18:21

BalusC

1,082,665
372
3,610
3,555

and htmlunit is doing roughly what your answer tells, so +1 for clarifying what exactly should happen. – Bozho Jan 31 '10 at 18:48
1

HtmlUnit is however less bloated than jTidy. It offers *builtin* ways to open a webpage and obtaining elements/attributes of interest using XPath. – BalusC Jan 31 '10 at 19:28

score 12 · Accepted Answer · answered Jan 31 '10 at 18:23

12

HtmlUnit has HtmlPage.getElementsByTagName("img"), which will probably suit you.

(read the short Get started guide to see how to obtain the correct HtmlPage object)

answered Jan 31 '10 at 18:23

Bozho

588,226
146
1,060
1,140

score 4 · Answer 3 · answered Jan 31 '10 at 18:52

This is dead simple with HTML Parser (and any other decent HTML parser):

Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));

for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
    Tag tag = (Tag) iterator.nextNode();
    System.out.println(tag.getAttribute("src"));
}

score 0 · Answer 4 · answered Jan 31 '10 at 18:21

0

You can use wget that has a lot of options available.

Or google for java wget ...

answered Jan 31 '10 at 18:21

PeterMmm

24,152
13
73
111

score 0 · Answer 5 · answered Jan 31 '10 at 18:24

0

You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobra is one of them.

answered Jan 31 '10 at 18:24

craftsman

15,133
17
70
86

score 0 · Answer 6 · answered May 09 '16 at 03:52

With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):

    Parser parser = new Parser(url);

    PageMeta pageMeta = new PageMeta();
    pageMeta.setUrl(url);

    NodeList meta = parser.parse(new TagNameFilter("meta"));
    for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
        Tag tag = (Tag) iterator.nextNode();

        if ("og:image".equals(tag.getAttribute("property"))) {
            pageMeta.setImageUrl(tag.getAttribute("content"));
        }

        if ("og:title".equals(tag.getAttribute("property"))) {
            pageMeta.setTitle(tag.getAttribute("content"));
        }

        if ("og:description".equals(tag.getAttribute("property"))) {
            pageMeta.setDescription(tag.getAttribute("content"));
        }
    }

score 0 · Answer 7 · edited Feb 08 '18 at 08:38

You can simply use regular expression in Java

<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" /> 
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>

    String s ="html";  //above html content
    Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
    Matcher  m = p.matcher (s);
    while (m.find()) {
        String src = m.group();
        int startIndex = src.indexOf("src=") + 5;
        String srcTag = src.substring(startIndex, src.length());
        System.out.println( srcTag );
    }

Get all Images from WebPage Program | Java

7 Answers7

Linked