Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

Elements links = doc.select("a[href]")

This will select any a with a href attribute, i.e. any link on the page.
Elements pngs = doc.select("img[src$=.png]")

This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.
Use the website's API, if it offers one,
To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

Scrape and parse HTML from a URL, file, or string.
Find and extract data, using DOM traversal or CSS selectors.
Manipulate the HTML elements, attributes, and text.
Clean user-submitted content against a safe white-list, to prevent XSS attacks.
Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions

114

votes

15 answers

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code: public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings="

java jsoup

asked Apr 12 '11 at 19:11

Billy

1,141
2
8
3

104

votes

6 answers

Jsoup SocketTimeoutException: Read timed out

I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup. For example, I got a list of links : link1 link2

java jsoup

asked Jul 04 '11 at 12:32

C. Maillard

1,041
2
7
4

votes

10 answers

How to "scan" a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the…

java html web-scraping jsoup

asked May 14 '10 at 15:48

James

5,622
9
34
42

votes

6 answers

jsoup posting and cookie

I'm trying to use jsoup to login to a site and then scrape information, I am running into in a problem, I can login successfully and create a Document from index.php but I cannot get other pages on the site. I know I need to set a cookie after I…

java screen-scraping jsoup

asked Jun 21 '11 at 22:56

Gwindow

votes

3 answers

jsoup - strip all formatting and link tags, keep text only

Let's say i have a html fragment like this:

foo bar foobar baz

What i want to extract from that is: foo bar foobar baz So my question is: how can i strip all the wrapping tags from a html and get only…

java html jsoup

asked Oct 17 '12 at 21:31

WonderCsabo

11,947
13
63
105

votes

8 answers

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup? Can't paste page code here, since…

java html web-scraping jsoup

asked Sep 20 '11 at 17:01

Eugene

4,352
8
55
79

votes

4 answers

JSoup UserAgent, how to set it right?

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0). I'm setting my User Agent like this: doc = Jsoup.connect(url) …

jsoup

asked Jul 05 '11 at 11:06

Markus

votes

7 answers

How to add proxy support to Jsoup?

I am a beginner to Java and my first task is to parse some 10,000 URLs and extract some info out of it, for this I am using Jsoup and it's working fine. But now I want to add proxy support to it. The proxies have a username and password too.

java jsoup

asked Sep 20 '11 at 09:11

Himanshu

1,433
4
24
35

votes

2 answers

Connection error: "org.jsoup.UnsupportedMimeTypeException: Unhandled content type"

When I try to open a link to parse with jsoup I get an error. Connection command: Document doc = Jsoup.connect("http://www.rfi.ro/podcast/emisiune/174/feed.xml") .timeout(10 * 1000).get(); Errors thrown: Exception in thread "main"…

java jsoup httpconnection

asked May 01 '13 at 21:50

user2340897

votes

1 answer

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here: Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and…

java parsing jsoup talend

asked Jul 24 '14 at 13:01

ZedBrannigan

votes

3 answers

How to parse HTML table using jsoup?

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse - If you see my below table, it has three tr as of now (I have shorten it down…

java html parsing jsoup

asked Jul 16 '14 at 05:29

john

11,311
40
131
251

votes

1 answer

How to parse XML with jsoup

I am trying to parse XML with jsoup, but I can't find any examples on this task. My XML document looks like this: xxx xxx …

java xml-parsing jsoup

asked Mar 27 '12 at 09:16

JavaCake

4,075
14
62
125

votes

4 answers

Jsoup: how to get an image's absolute url?

Is there a way in jsoup to extract an image absolute url, much like one can get a link's absolute url? Consider the following image element found in http://www.example.com/

I would like to…

jsoup

asked Feb 02 '11 at 13:35

r0u1i

3,526
6
28
36

votes

2 answers

(how) can I download an image using JSoup?

I already know where the image is, but for simplicity's sake I wanted to download the image using JSoup itself. (This is to simplify getting cookies, referrer, etc.) This is what I have so far: //Open a URL Stream Response resultImageResponse =…

java jsoup

asked Sep 17 '12 at 19:04

user1499731

votes

3 answers

Does jsoup support xpath?

There's some work in progress related to adding xpath support to jsoup https://github.com/jhy/jsoup/pull/80. Is it working? How can I use it?

xpath jsoup

asked Aug 16 '11 at 21:54

gguardin

2 3

…

99 100 Next