Web scraping with Java

Question

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

score 102 · Accepted Answer · edited Jun 21 '18 at 00:45

102

jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.

edited Jun 21 '18 at 00:45

Basil Bourque

303,325
100
852
1,154

answered Jul 08 '10 at 09:44

Wajdy Essam

4,280
3
28
33

2

Thanks, it's a nice library with no dependencies so it's quite lightweight. Also, it's headless so it doesn't need a browser (I've had problems with **Selenium** opening Chrome and I couldn't use **HtmlUnit** at all). **Selenium** must be more realistic but this library might serve the purpose in most scraping cases and it's really easy to setup: add the dependency and you're good to go. – Ferran Maylinch May 31 '14 at 17:13
Excellent library indeed. Easy setup and powerful regex support. doc.select("li[id^=cosid_]"). Cool. – EMM Jul 19 '16 at 15:21
1

I have recently open sourced my web scraping framework that not only allows you to parse the documents with Jsoup and HtmlUnit, but also handles the parallelization for you and can manage a large pool of proxy servers if required: https://github.com/subes/invesdwin-webproxy – subes Jun 09 '17 at 18:57
@subes can your framework be used for web analytics testing ? – vikramvi Nov 11 '17 at 10:53
My requirement is to do "Web Analytics" automation, is Jsoup can do all the testing activities ? – vikramvi Nov 11 '17 at 10:54
Well you can automate users coming from various parts of the world by creating proxy enabled bots (just web scrapers that navigate your analytics enabled website). Though be aware that some analytics packages filter users from public proxies. So better use a service for this or your own servers with proxies installed. Or you can navigate your analytics website itself and collect data for a dashboard there. So yes, but JSoup might not be enough for this since HtmlUnit provides cookies, JS and other essential support for this. – subes Nov 12 '17 at 11:17

score 23 · Answer 2 · edited Jun 20 '20 at 09:12

23

Your best bet is to use Selenium Web Driver since it

Provides visual feedback to the coder (see your scraping in action, see where it stops)
Accurate and Consistent as it directly controls the browser you use.
Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

Htmlunit is fast but is horrible at handling Javascript and AJAX.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 23 '10 at 19:45

KJW

15,035
47
137
243

1

Adding here that to boost up the performance in Selenium Web Driver, you can use headless browser (Chrome, Firefox) – Adi Ohana Apr 29 '19 at 11:52

score 15 · Answer 3 · answered Jul 21 '11 at 12:22

15

HTMLUnit can be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more

answered Jul 21 '11 at 12:22

Beschi

185
1
3

score 5 · Answer 4 · answered Sep 17 '12 at 21:31

mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

http://gistlabs.com/software/mechanize-for-java/ (and the GitHub here https://github.com/GistLabs/mechanize)

score 5 · Answer 5 · answered Sep 19 '17 at 14:47

5

There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com

answered Sep 19 '17 at 14:47

Slavus

1,168
12
20

score 5 · Answer 6 · edited Jun 08 '21 at 08:34

You might look into jwht-scraper!

This is a complete scraping framework that has all the features a developper could expect from a web scraper :

It works with (jwht-htmltopojo)[https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses Jsoup mentionned by several other people here.

Together they will help you built awesome scrapers mapping directly HTML to POJOs and bypassing any classical scraping problems in only a matter of minutes!

Hope this might help some people here!

Disclaimer, I am the one who developed it, feel free to let me know your remarks!

score 4 · Answer 7 · answered Jul 08 '10 at 09:45

4

Look at an HTML parser such as TagSoup, HTMLCleaner or NekoHTML.

answered Jul 08 '10 at 09:45

Mikos

8,455
10
41
72

score 4 · Answer 8 · answered Jan 23 '18 at 16:46

4

If you wish to automate scraping of large amount pages or data, then you could try Gotz ETL.

It is completely model driven like a real ETL tool. Data structure, task workflow and pages to scrape are defined with a set of XML definition files and no coding is required. Query can be written either using Selectors with JSoup or XPath with HtmlUnit.

answered Jan 23 '18 at 16:46

Maithilish

965
12
21

Asked 7 years, 6 months ago. – Eritrean Jan 23 '18 at 17:32

score 4 · Answer 9 · answered Sep 09 '20 at 22:14

For tasks of this type I usually use Crawller4j + Jsoup.

With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.

With jsoup, I "parsed" the html data you have searched for and downloaded with crawler4j.

Normally you can also download data with jsoup, but Crawler4J makes it easier to find links. Another advantage of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads

https://github.com/yasserg/crawler4j/wiki

score 1 · Answer 10 · answered Sep 18 '20 at 11:13

Normally I use selenium, which is software for testing automation. You can control a browser through a webdriver, so you will not have problems with javascripts and it is usually not very detected if you use the full version. Headless browsers can be more identified.

Web scraping with Java

10 Answers10

jsoup

Linked