6

I'm trying to use the Basic crawler example in crawler4j. I took the code from the crawler4j website here.

package edu.crawler;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
                    + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

    /**
     * You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic).
     */
    @Override
    public boolean shouldVisit(WebURL url) {
            String href = url.getURL().toLowerCase();
            return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page) {
            int docid = page.getWebURL().getDocid();
            String url = page.getWebURL().getURL();
            String domain = page.getWebURL().getDomain();
            String path = page.getWebURL().getPath();
            String subDomain = page.getWebURL().getSubDomain();
            String parentUrl = page.getWebURL().getParentUrl();
            String anchor = page.getWebURL().getAnchor();

            System.out.println("Docid: " + docid);
            System.out.println("URL: " + url);
            System.out.println("Domain: '" + domain + "'");
            System.out.println("Sub-domain: '" + subDomain + "'");
            System.out.println("Path: '" + path + "'");
            System.out.println("Parent page: " + parentUrl);
            System.out.println("Anchor text: " + anchor);

            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    String html = htmlParseData.getHtml();
                    List<WebURL> links = htmlParseData.getOutgoingUrls();

                    System.out.println("Text length: " + text.length());
                    System.out.println("Html length: " + html.length());
                    System.out.println("Number of outgoing links: " + links.size());
            }

            Header[] responseHeaders = page.getFetchResponseHeaders();
            if (responseHeaders != null) {
                    System.out.println("Response headers:");
                    for (Header header : responseHeaders) {
                            System.out.println("\t" + header.getName() + ": " + header.getValue());
                    }
            }

            System.out.println("=============");
    }
}

Above is the code for the crawler class from the example.

public class Controller {

    public static void main(String[] args) throws Exception {
            String crawlStorageFolder = "../data/";
            int numberOfCrawlers = 7;

            CrawlConfig config = new CrawlConfig();
            config.setCrawlStorageFolder(crawlStorageFolder);

            /*
             * Instantiate the controller for this crawl.
             */
            PageFetcher pageFetcher = new PageFetcher(config);
            RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
            CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

            /*
             * For each crawl, you need to add some seed urls. These are the first
             * URLs that are fetched and then the crawler starts following links
             * which are found in these pages
             */
            controller.addSeed("http://www.ics.uci.edu/~welling/");
            controller.addSeed("http://www.ics.uci.edu/~lopes/");
            controller.addSeed("http://www.ics.uci.edu/");

            /*
             * Start the crawl. This is a blocking operation, meaning that your code
             * will reach the line after this only when crawling is finished.
             */
            controller.start(MyCrawler.class, numberOfCrawlers);
    }
}

Above is the class for the controller class for the web crawler. When I try to run the Controller class from my IDE (Intellij) I get the following error:

Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0

Is there something about the maven config that is found here that I should know? Do I have to use a different version or something?

j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
  • 1
    From the sounds of it, you're trying to execute a version of the code that was compiled on a later version of Java then the one your are running. For example. The code was compiled with Java 7 and your running Java 6 or it was compiled with Java 6 and you are running Java 5... – MadProgrammer Mar 14 '13 at 00:29
  • Check out http://stackoverflow.com/questions/10382929/unsupported-major-minor-version-51-0 – Farlan Mar 14 '13 at 00:34
  • @hey j.jerrod taylor..I am facing the issue in the very basic program.i am getting an exception Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/client/methods/HttpUriRequest at com.crawler.web.BasicCrawlController.main(BasicCrawlController.java:78) Caused by: java.lang.ClassNotFoundException: org.apache.http.client.methods.HttpUriRequest ,Please suggest if any other Jar is also required. – Amritpal Singh Jun 14 '13 at 16:47
  • @AmritpalSingh For me the problem was that I was using a version of the code that was compiled using a different version of Java than the one that I had installed on my computer. If you have the same problem then you should either update your version of Java or use an older version of the code. – j.jerrod.taylor Jun 15 '13 at 19:32

1 Answers1

1

The problem wasn't with crawler4j. The problem was that the version of Java that I was using was different from the latest version of Java that is used in crawler4j. I switched the version right before they updated to Java 7 and everything worked fine. I'm guessing that upgrading my version of Java to 7 would have the same effect.

j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
  • shall i crawl dynamic website using crawler4j (java). http://stackoverflow.com/questions/27264931/crawling-dynamic-website-using-java?noredirect=1#comment43002565_27264931 – BasK Dec 03 '14 at 11:42