2

I am doing an assignment for one of my classes.

I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.

I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.

I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.

Question : Anyone know any specialized parser api for images and links ?

I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.

i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have: **

    HttpConnection mimeConn =null;
    Response mimeResponse = null;
    for(Element link: links){

        String linkurl =link.absUrl("href");
        if(!linkurl.contains("#")){

            if(DownloadRepository.curlExists(link.absUrl("href"))){
                continue;
            }

            mimeConn = (HttpConnection) Jsoup.connect(linkurl);
            mimeConn.ignoreContentType(true);
            mimeConn.ignoreHttpErrors(true);
            mimeResponse =(Response) mimeConn.execute();

            WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
            String contentType = mimeResponse.contentType();

            if(contentType.contains("html")){
                page.addToCrawledPages(new WebPage(webUrl));
            }else if(contentType.contains("image")){                    
                page.addToImages(new WebImage(webUrl));
            }else{
                page.addToFiles(new WebFile(webUrl));
            }

            DownloadRepository.addCrawledURL(linkurl);

        }**

UPDATE Based on Yoshi's answer, I was able to get my code to work right. Here's the link:

https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java

Kara
  • 6,115
  • 16
  • 50
  • 57
Emmanuel John
  • 2,296
  • 1
  • 23
  • 30
  • If your that lazy check `wget` to do the same – Viren Feb 15 '13 at 12:30
  • 1
    Java development is alot about research, finding the best API for a given problem domain and the using it to solve your problem. Sure, be lazy and don't reinvent the wheel, but don't be lazy enough to not do your own research. – sbk Feb 15 '13 at 12:39

1 Answers1

5

Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.

Several steps:

  1. Jsoup: how to get an image's absolute url?
  2. how to download image from any web page in java
  3. You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.

You don't need to use URLConnection class, jSoup have wrapper for it.

e.g

You can use only one line of code to get DOM object:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

Instead of this code:

    URL oracle = new URL("http://www.oracle.com/");
    URLConnection yc = oracle.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(
                                yc.getInputStream()));
    String inputLine;
    while ((inputLine = in.readLine()) != null) 
        System.out.println(inputLine);
    in.close();

Update1 try to add in your code next lines:

Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
Community
  • 1
  • 1
Ishikawa Yoshi
  • 1,779
  • 8
  • 22
  • 43