Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

  • Elements links = doc.select("a[href]")

    This will select any a with a href attribute, i.e. any link on the page.

  • Elements pngs = doc.select("img[src$=.png]")

    This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

  • Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.

  • Use the website's API, if it offers one,

  • To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

  • Scrape and parse HTML from a URL, file, or string.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulate the HTML elements, attributes, and text.
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks.
  • Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions
114
votes
15 answers

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code: public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings="
Billy
  • 1,141
  • 2
  • 8
  • 3
104
votes
6 answers

Jsoup SocketTimeoutException: Read timed out

I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup. For example, I got a list of links : link1 link2
C. Maillard
  • 1,041
  • 2
  • 7
  • 4
57
votes
10 answers

How to "scan" a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the…
James
  • 5,622
  • 9
  • 34
  • 42
52
votes
6 answers

jsoup posting and cookie

I'm trying to use jsoup to login to a site and then scrape information, I am running into in a problem, I can login successfully and create a Document from index.php but I cannot get other pages on the site. I know I need to set a cookie after I…
Gwindow
  • 645
  • 2
  • 9
  • 7
51
votes
3 answers

jsoup - strip all formatting and link tags, keep text only

Let's say i have a html fragment like this:

foo bar foobar baz

What i want to extract from that is: foo bar foobar baz So my question is: how can i strip all the wrapping tags from a html and get only…
WonderCsabo
  • 11,947
  • 13
  • 63
  • 105
46
votes
8 answers

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup? Can't paste page code here, since…
Eugene
  • 4,352
  • 8
  • 55
  • 79
46
votes
4 answers

JSoup UserAgent, how to set it right?

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0). I'm setting my User Agent like this: doc = Jsoup.connect(url) …
Markus
  • 521
  • 1
  • 5
  • 7
45
votes
7 answers

How to add proxy support to Jsoup?

I am a beginner to Java and my first task is to parse some 10,000 URLs and extract some info out of it, for this I am using Jsoup and it's working fine. But now I want to add proxy support to it. The proxies have a username and password too.
Himanshu
  • 1,433
  • 4
  • 24
  • 35
43
votes
2 answers

Connection error: "org.jsoup.UnsupportedMimeTypeException: Unhandled content type"

When I try to open a link to parse with jsoup I get an error. Connection command: Document doc = Jsoup.connect("http://www.rfi.ro/podcast/emisiune/174/feed.xml") .timeout(10 * 1000).get(); Errors thrown: Exception in thread "main"…
user2340897
  • 433
  • 1
  • 4
  • 4
39
votes
1 answer

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here: Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and…
ZedBrannigan
  • 601
  • 1
  • 8
  • 18
39
votes
3 answers

How to parse HTML table using jsoup?

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse - If you see my below table, it has three tr as of now (I have shorten it down…
john
  • 11,311
  • 40
  • 131
  • 251
38
votes
1 answer

How to parse XML with jsoup

I am trying to parse XML with jsoup, but I can't find any examples on this task. My XML document looks like this: xxx xxx
JavaCake
  • 4,075
  • 14
  • 62
  • 125
38
votes
4 answers

Jsoup: how to get an image's absolute url?

Is there a way in jsoup to extract an image absolute url, much like one can get a link's absolute url? Consider the following image element found in http://www.example.com/ I would like to…
r0u1i
  • 3,526
  • 6
  • 28
  • 36
36
votes
2 answers

(how) can I download an image using JSoup?

I already know where the image is, but for simplicity's sake I wanted to download the image using JSoup itself. (This is to simplify getting cookies, referrer, etc.) This is what I have so far: //Open a URL Stream Response resultImageResponse =…
user1499731
33
votes
3 answers

Does jsoup support xpath?

There's some work in progress related to adding xpath support to jsoup https://github.com/jhy/jsoup/pull/80. Is it working? How can I use it?
gguardin
  • 551
  • 1
  • 4
  • 9
1
2 3
99 100