Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.
Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.
Example
Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements
:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Selecting specific content
The select(...)
method is used to select a subset of the Elements
from a a Document
. This method accepts a CSS selector to specify which elements are selected and returned.
Some examples of usage, after loading or parsing an HTML document:
Elements links = doc.select("a[href]")
This will select any
a
with ahref
attribute, i.e. any link on the page.Elements pngs = doc.select("img[src$=.png]")
This will select any
img
element where the value of thesrc
attribute ends in.png
, so this will select any image which is a PNG image.
This method returns an Elements
list which contains all the elements matched by the selector.
There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.
JavaScript support
Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.
If you want to get such dynamically loaded data, you can:
Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.
Use the website's API, if it offers one,
To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.
Open source
Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.
jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.
Jsoup can be used to ...
- Scrape and parse HTML from a URL, file, or string.
- Find and extract data, using DOM traversal or CSS selectors.
- Manipulate the HTML elements, attributes, and text.
- Clean user-submitted content against a safe white-list, to prevent XSS attacks.
- Output tidy HTML.
Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.
Official Website: http://jsoup.org/
Useful Links: