Parsing HTML webpages in Java

Question

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

Jsoup supports both DOM traversal and [CSS] selectors, no? (Why use regular expressions? :-/) — , Jul 14 '11 at 03:13

score 5 · Accepted Answer · answered Jul 14 '11 at 02:55

5

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

answered Jul 14 '11 at 02:55

Ed Staub

15,480
3
61
91

Thanks. I got the jsoup code working. Its has a running time of 2 minutes. – samwise Jul 14 '11 at 03:19

score 2 · Answer 2 · edited Jul 14 '11 at 03:00

2

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

edited Jul 14 '11 at 03:00

MD Sayem Ahmed

28,628
27
111
178

answered Jul 14 '11 at 02:54

uncaught_exceptions

21,712
4
41
48

Jericho is a good alternative too. I've used Nutch and Jericho, but have no experience with JSoup so can't comment on why it would be taking so long. – jkraybill Jul 14 '11 at 04:49

score 0 · Answer 3 · answered Jul 14 '11 at 02:56

0

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

answered Jul 14 '11 at 02:56

JustBeingHelpful

18,332
38
160
245

Except ... HTML is not XML. I suspect this post *wouldn't* have received a down-vote (not mine) if a link to a library that exposed HTML via XPath was also included. (Such tools, which are capable of treating HTML "as" an XML DOM, are definitely worth talking about.) – Jul 14 '11 at 02:58
XPath is for XML, and won't work on any HTML that isn't XML compatible. – Ed Staub Jul 14 '11 at 02:58
it's used for both HTML and XML. http://tech-read.com/2011/03/09/extract-html-content-using-xpath/ – JustBeingHelpful Jul 14 '11 at 03:02
@Mr. Wanta Yes, so what *Java* library parses *HTML* (not just *XML*) and exposes XPath over it? :) This answer isn't bad, but it is missing some important pieces of the puzzle. (Note that [jsoup](http://jsoup.org/), which the question is tagged, supports CSS selectors, *but not* XPath -- it looks like [this feature is requested](http://nextsprocket.com/tasks/4ce4ad9d9793fd31a8000002)) – Jul 14 '11 at 03:09
Here's an example of use XOM and TagSoup to find elements in HTML - http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser/774519#774519 – laz Jul 14 '11 at 03:19
@Mr. Wanta - [Regular expressions are also used](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) to parse HTML, but that doesn't make it a good idea. XPath will work on some pages. It won't on others. If you don't have complete control over your input set, it's a bad choice. – Ed Staub Jul 14 '11 at 03:28
@Mr. Wanta Yes! :-) Now update the post with the references to those libraries, in conjunction with the assertion that using XPath is a good (or at least better) way to handle this. I have given a +1 for these anticipated future changes. – Jul 14 '11 at 03:31
That's for the clarifications. I learned something today. :-) That's why I love Stack Overflow! To feel more humble and learn more! – JustBeingHelpful Jul 14 '11 at 03:52

Parsing HTML webpages in Java

3 Answers3