Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png gallery1.jpg test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks Philip
Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png gallery1.jpg test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks Philip
Just use a simple HTML parser, like jTidy, and then get all elements by tag name img
and then collect the src
attribute of each in a List<String>
or maybe List<URI>
.
You can obtain an InputStream
of an URL
using URL#openStream()
and then feed it to any HTML parser you like to use. Here's a kickoff example:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
HtmlUnit has HtmlPage.getElementsByTagName("img")
, which will probably suit you.
(read the short Get started guide to see how to obtain the correct HtmlPage
object)
This is dead simple with HTML Parser (and any other decent HTML parser):
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):
Parser parser = new Parser(url);
PageMeta pageMeta = new PageMeta();
pageMeta.setUrl(url);
NodeList meta = parser.parse(new TagNameFilter("meta"));
for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
if ("og:image".equals(tag.getAttribute("property"))) {
pageMeta.setImageUrl(tag.getAttribute("content"));
}
if ("og:title".equals(tag.getAttribute("property"))) {
pageMeta.setTitle(tag.getAttribute("content"));
}
if ("og:description".equals(tag.getAttribute("property"))) {
pageMeta.setDescription(tag.getAttribute("content"));
}
}
You can simply use regular expression in Java
<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" />
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>
String s ="html"; //above html content
Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
Matcher m = p.matcher (s);
while (m.find()) {
String src = m.group();
int startIndex = src.indexOf("src=") + 5;
String srcTag = src.substring(startIndex, src.length());
System.out.println( srcTag );
}