Java Web Crawler Libraries

Question

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions.

How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions)
What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing.

score 15 · Answer 1 · edited Sep 07 '16 at 06:18

15

Crawler4j is the best solution for you,

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!

Also visit. for more java based web crawler tools and brief explanation for each.

edited Sep 07 '16 at 06:18

Tinman

786
6
18

answered Nov 18 '12 at 01:46

cuneytykaya

579
1
5
14

score 11 · Accepted Answer · edited Feb 17 '18 at 12:25

11

This is How your program 'visit' or 'connect' to web pages.

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));

        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }

This will download source of html page.

For HTML parsing see this

Also take a look at jSpider and jsoup

edited Feb 17 '18 at 12:25

Eugene

10,627
5
49
67

answered Jul 01 '12 at 13:51

Adil Shaikh

44,509
17
89
111

So, does this pull information from a page, or simply go to the page? I'm trying to write a crawler that will take user input, go to maps.google.com, plug in the address and take route time and route length and bring it back to the program. Is this possible? – Ungeheuer May 16 '15 at 15:59
@Adrian have a look at google maps api : https://developers.google.com/maps/documentation/distance-matrix/start – Adil Shaikh Jan 31 '17 at 15:37

Vishnu · Answer 3 · 2014-11-24T10:49:48.173

7

Right now there is a inclusion of many java based HTML parser that support visiting and parsing the HTML pages.

Here's the complete list of HTML parser with basic comparison.

edited Nov 24 '14 at 10:49

answered Nov 24 '14 at 07:40

Vishnu

1,011
14
31

1

@Jean-FrançoisCorbett: I have refined my answer now. – Vishnu Nov 24 '14 at 10:57
1

There is a difference between crawling and scraping. None of your examples can crawl the Net. – rustyx Sep 05 '18 at 08:40

score 5 · Answer 4 · answered Sep 05 '18 at 09:25

Have a look at these existing projects if you want to learn how it can be done:

A typical crawler process is a loop consisting of fetching, parsing, link extraction, and processing of the output (storing, indexing). Though the devil is in the details, i.e. how to be "polite" and respect robots.txt, meta tags, redirects, rate limits, URL canonicalization, infinite depth, retries, revisits, etc.

^{Flow diagram courtesy of Norconex HTTP Collector.}

score 4 · Answer 5 · edited Dec 10 '12 at 22:40

4

For parsing content, I'm using Apache Tika.

edited Dec 10 '12 at 22:40

Dave Clemmer

3,741
12
49
72

answered Dec 10 '12 at 14:37

Waji

71
3

Panagiotis Drakatos · Answer 6 · 2019-08-03T12:18:29.323

I come up with another solution to propose that no one mention. There is a library called Selenum it is is an open-source automating testing tool used for automating web applications for testing purposes, but is certainly not limited to only this . You can write a web crawler and get benefited from this automation testing tool just as a human would do.

As an illustration, i will provide to you a quick tutorial to get a better look of how it works. if you are being bored to read this post take a look at this Video to understand what capabilities this library can offer in order to crawl web pages.

Selenium Components

To begin with Selenium consist of various components that coexisted in a unique process and perform their action on the java program. This main component is called Webdriver and it must be included in your program in order to make it working properly.

Go to the following site here and download the latest release for your computer OS (Windows, Linux, or MacOS). It is a ZIP archive containing chromedriver.exe. Save it on your computer and then extract it to a convenient location just as C:\WebDrivers\User\chromedriver.exe We will use this location later in the java program.

The next step is to inlude the jar library. Assuming you are using maven project to build the java programm you need to add the follow dependency to your pom.xml

<dependency>
 <groupId>org.seleniumhq.selenium</groupId>
 <artifactId>selenium-java</artifactId>
 <version>3.8.1</version>
</dependency>

Selenium Web driver Setup

Let us get started with Selenium. The first step is to create a ChromeDriver instance:

System.setProperty("webdriver.chrome.driver", "C:\WebDrivers\User\chromedriver.exe);
WebDriver driver = new ChromeDriver();

Now its time to get deeper in code.The following example shows a simple programma that open a web page and extract some useful Html components. It is easy to understand, as it has comments that explain the steps clearly. Please take a brief look to understand how to capture the objects

//Launch website
      driver.navigate().to("http://www.calculator.net/");

      //Maximize the browser
      driver.manage().window().maximize();

      // Click on Math Calculators
      driver.findElement(By.xpath(".//*[@id = 'menu']/div[3]/a")).click();

      // Click on Percent Calculators
      driver.findElement(By.xpath(".//*[@id = 'menu']/div[4]/div[3]/a")).click();

      // Enter value 10 in the first number of the percent Calculator
      driver.findElement(By.id("cpar1")).sendKeys("10");

      // Enter value 50 in the second number of the percent Calculator
      driver.findElement(By.id("cpar2")).sendKeys("50");

      // Click Calculate Button
      driver.findElement(By.xpath(".//*[@id = 'content']/table/tbody/tr[2]/td/input[2]")).click();


      // Get the Result Text based on its xpath
      String result =
         driver.findElement(By.xpath(".//*[@id = 'content']/p[2]/font/b")).getText();


      // Print a Log In message to the screen
      System.out.println(" The Result is " + result);

Once you are done with your work, the browser window can be closed with:

driver.quit();

Selenium Browser Options

There too much functionality you can implement when you working with this library, For example, assuming you are using chrome you can add in your code

ChromeOptions options = new ChromeOptions();

Take look at how we can use WebDriver to open Chrome extensions using ChromeOptions

options.addExtensions(new File("src\test\resources\extensions\extension.crx"));

This is for using Incognito mode

options.addArguments("--incognito");

this one for disabling javascript and info bars

options.addArguments("--disable-infobars");
options.addArguments("--disable-javascript");

this one if you want to make the browser scraping silently and hide browser crawling in the background

options.addArguments("--headless");

once you have done with it then

WebDriver driver = new ChromeDriver(options);

To sum up let's see what Selenium has to offer and make it a unique choice compared with the other solutions that proposed on this post thus far.

Language and Framework Support
Open Source Availability
Multi-Browser Support
Support Across Various Operating Systems
Ease Of Implementation
Reusability and Integrations
Parallel Test Execution and Faster Go-to-Market
Easy to Learn and Use
Constant Updates

score 2 · Answer 7 · answered Jul 01 '12 at 13:58

2

I recommend you to use the HttpClient library. You can found examples here.

answered Jul 01 '12 at 13:58

Benoit

1,995
1
13
18

score 2 · Answer 8 · edited Mar 06 '16 at 21:15

2

I would prefer crawler4j. Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in few hours.

edited Mar 06 '16 at 21:15

josliber

43,891
12
98
133

answered Feb 22 '14 at 01:02

Vivek Vermani

1,934
18
45

score 2 · Answer 9 · answered Jan 03 '15 at 12:19

2

I think jsoup is better than others, jsoup runs on Java 1.5 and up, Scala, Android, OSGi, and Google App Engine.

answered Jan 03 '15 at 12:19

Saeed Zarinfam

9,818
7
59
72

score 0 · Answer 10 · answered Jul 01 '12 at 18:06

0

You can explore.apache droid or apache nutch to get the feel of java based crawler

answered Jul 01 '12 at 18:06

Sagar

1,315
8
15

score 0 · Answer 11 · answered Feb 18 '14 at 17:50

Though mainly used for Unit Testing web applications, HttpUnit traverses a website, clicks links, analyzes tables and form elements, and gives you meta data about all the pages. I use it for Web Crawling, not just for Unit Testing. - http://httpunit.sourceforge.net/

score 0 · Answer 12 · answered Jan 26 '17 at 07:04

0

Here is a list of available crawler:

https://java-source.net/open-source/crawlers

But I suggest using Apache Nutch

answered Jan 26 '17 at 07:04

sendon1982

9,982
61
44

Java Web Crawler Libraries

12 Answers12

Selenium Components

Selenium Web driver Setup

Selenium Browser Options

Linked