1

It's perfectly easy to download all images from a website using wget.

But I need this feature on client-side, best would be in Java.

I know wget's source can be accessed online, but I don't know any C and the source is quite complex. Of course, wget has also other features which "blow up the source" for me.

As Java has a built-in HttpClient, yet I don't know how sophisticated wget really is, could you tell me if it is hard to re-implement the "download all images recursively" feature in Java?

How is this done, exactly? Does wget fetch the HTML source code of the given URL, extract all URLs with the given file endings (.jpg, .png) from the HTML and downloads them? Does it also search for images in the stylesheets that are linked in that HTML document?

How would you do this? Would you use regular expressions to search for (both relative and absolute) image URLs within the HTML document and let HttpClient download each of them? Or is there already some Java library that does something similar?

Community
  • 1
  • 1
caw
  • 30,999
  • 61
  • 181
  • 291
  • 1
    You might want to take a look at [Jerry](http://jodd.org/doc/jerry/). It provides JQuery like selectors for HTML documents and it might help you find all of the images to download. – Christian Trimble Sep 16 '13 at 15:28
  • If you are familiar with wget. why dont you use wget in java? I mean write a simple java class that will invoke a script which will inturn contain your wget!! – Krishna Sep 16 '13 at 23:00
  • @Krishna: I'm implementing this task for two programs, one that runs on Android and one on Windows, where I don't have access to wget, unfortunately. This is why I need a pure Java solution, without calling any external programs. – caw Sep 17 '13 at 03:04
  • @C.Trimble: Thanks, Jerry definitely looks cool and is good to have in the toolbox :) – caw Sep 17 '13 at 03:05

3 Answers3

2

In Java you could use the Jsoup library to parse any web page and extract anything you want

DraganescuValentin
  • 842
  • 2
  • 10
  • 16
  • 1
    Thank you, Jsoup looks quite good, although it only lets you find the `img` tags, which is merely a small part of scraping all images. – caw Sep 13 '13 at 20:42
0

For me crawler4j was the open source library to recursively crawl (and replicate) a site, e.g. like this (their QuickStart example): (it also supports CSS URL crawling)

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

More webcrawlers and HTML parsers can be found here.

Community
  • 1
  • 1
Andreas Covidiot
  • 4,286
  • 5
  • 51
  • 96
  • one problem that I had with `wget` was, that I have to use the `--regex-ignore` and `--regex-accept` options and my regexps are quite complex and testing it (the regexps) on the console is a pain in the a... they seem to operate differently than `sed` or `sed -e` regexp parsing and additionally I hate to escape all the `()+` etc. signs which makes them quite more unreadable and it is different to the style it works in Java Pattern syntax. – Andreas Covidiot Apr 26 '17 at 13:39
-1

Found this program which downloads images. It is open source.

You could get the images in a website using the <IMG> tags. Look into the following question. It might help you. Get all Images from WebPage Program | Java

Community
  • 1
  • 1
Nufail
  • 1,588
  • 1
  • 17
  • 31
  • Thank you! While "Java Image Downloader" (first link) does not seem to be the solution, "HtmlUnit" seems quite interesting. However, it won't retrieve background images and images linked in stylesheets, whereas wget does so. – caw Sep 10 '13 at 02:45
  • Unfortunately, `HtmlUnit` has lots of dependencies and thus a size of about 12MB (without sources). I'm sure there are more lightweight libraries. – caw Sep 13 '13 at 20:41