8

hi all I'm writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html.

What's the preferred java lib for doing these kinds of things?

thanks

James
  • 15,085
  • 25
  • 83
  • 120
  • Take a look - http://stackoverflow.com/questions/1322335/what-is-the-best-java-library-to-use-for-http-post-get-etc – KV Prajapati Jul 02 '10 at 03:31

2 Answers2

9

You can use Apache HttpComponents Client for this (or "plain vanilla" the Java SE builtin and verbose URLConnection API). For the HTML parsing/traversing/manipulation part Jsoup may be useful.

Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like J-Spider Apache Nutch.

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
2

As BalusC said, have a look at Apache's HttpComponents Client. The Nutch project has solved lots of hard crawling/fetching/indexing problems, so if if you want to see how they solve the following 302, have a look at http://svn.apache.org/viewvc/nutch/trunk/src/

labratmatt
  • 1,821
  • 2
  • 20
  • 21