Best java lib for http connections?

Question

hi all I'm writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html.

What's the preferred java lib for doing these kinds of things?

thanks

Take a look - http://stackoverflow.com/questions/1322335/what-is-the-best-java-library-to-use-for-http-post-get-etc — KV Prajapati, Jul 02 '10 at 03:31

score 9 · Accepted Answer · edited May 23 '17 at 10:29

9

You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose URLConnection API). For the HTML parsing/traversing/manipulation part Jsoup may be useful.

Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like ~~J-Spider~~ Apache Nutch.

edited May 23 '17 at 10:29

Community

1
1

answered Jul 02 '10 at 03:20

BalusC

1,082,665
372
3,610
3,555

score 2 · Answer 2 · answered Jul 02 '10 at 03:42

2

As BalusC said, have a look at Apache's HttpComponents Client. The Nutch project has solved lots of hard crawling/fetching/indexing problems, so if if you want to see how they solve the following 302, have a look at http://svn.apache.org/viewvc/nutch/trunk/src/

answered Jul 02 '10 at 03:42

labratmatt

1,821
2
20
21

That's actually a better suggestion than J-Spider. – BalusC Jul 02 '10 at 03:58

Best java lib for http connections?

2 Answers2