4

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).

I used scanner objects with reg. expressions and jsoup with its html parser.

Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)

Is there anything better?

EDIT:

Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

samwise
  • 269
  • 2
  • 5
  • 13
  • 1
    Jsoup supports both DOM traversal and [CSS] selectors, no? (Why use regular expressions? :-/) –  Jul 14 '11 at 03:13

3 Answers3

5

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

Ed Staub
  • 15,480
  • 3
  • 61
  • 91
2

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

MD Sayem Ahmed
  • 28,628
  • 27
  • 111
  • 178
uncaught_exceptions
  • 21,712
  • 4
  • 41
  • 48
  • Jericho is a good alternative too. I've used Nutch and Jericho, but have no experience with JSoup so can't comment on why it would be taking so long. – jkraybill Jul 14 '11 at 04:49
0

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.

Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

JustBeingHelpful
  • 18,332
  • 38
  • 160
  • 245
  • Except ... HTML is not XML. I suspect this post *wouldn't* have received a down-vote (not mine) if a link to a library that exposed HTML via XPath was also included. (Such tools, which are capable of treating HTML "as" an XML DOM, are definitely worth talking about.) –  Jul 14 '11 at 02:58
  • XPath is for XML, and won't work on any HTML that isn't XML compatible. – Ed Staub Jul 14 '11 at 02:58
  • it's used for both HTML and XML. http://tech-read.com/2011/03/09/extract-html-content-using-xpath/ – JustBeingHelpful Jul 14 '11 at 03:02
  • @Mr. Wanta Yes, so what *Java* library parses *HTML* (not just *XML*) and exposes XPath over it? :) This answer isn't bad, but it is missing some important pieces of the puzzle. (Note that [jsoup](http://jsoup.org/), which the question is tagged, supports CSS selectors, *but not* XPath -- it looks like [this feature is requested](http://nextsprocket.com/tasks/4ce4ad9d9793fd31a8000002)) –  Jul 14 '11 at 03:09
  • Here's an example of use XOM and TagSoup to find elements in HTML - http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser/774519#774519 – laz Jul 14 '11 at 03:19
  • @Mr. Wanta - [Regular expressions are also used](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) to parse HTML, but that doesn't make it a good idea. XPath will work on some pages. It won't on others. If you don't have complete control over your input set, it's a bad choice. – Ed Staub Jul 14 '11 at 03:28
  • @Mr. Wanta Yes! :-) Now update the post with the references to those libraries, in conjunction with the assertion that using XPath is a good (or at least better) way to handle this. I have given a +1 for these anticipated future changes. –  Jul 14 '11 at 03:31
  • That's for the clarifications. I learned something today. :-) That's why I love Stack Overflow! To feel more humble and learn more! – JustBeingHelpful Jul 14 '11 at 03:52