I want to use HTTP GET and POST commands to retrieve URLs from a website and parse the HTML. How do I do this?
Asked
Active
Viewed 1.9k times
18
-
I have used [JTidy](http://jtidy.sourceforge.net/) in a project and it worked quite well. A list of other parsers is [here](http://java-source.net/open-source/html-parsers), but besides from JTidy I don't know any of them. – Markus Dec 11 '08 at 17:55
-
Use http://hc.apache.org/httpclient-3.x/ – Nick Holt Dec 11 '08 at 14:05
2 Answers
21
You can use HttpURLConnection in combination with URL.
URL url = new URL("http://example.com");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
InputStream stream = connection.getInputStream();
// read the contents using an InputStreamReader

Rob Hruska
- 118,520
- 32
- 167
- 192
-
2Create a BufferedReader using the InputStream to read the content into a string variable – rockit May 24 '10 at 15:07
-
Thank you. This shows the most basic way to do it. It's simple with an understanding of what's necessary to do a simple URL connection. However, the longer term strategy would be to use [HTTP Client ](http://hc.apache.org/httpcomponents-client/index.html "HTTP Client") for more advanced and feature rich ways to complete this task. – Johnny Maelstrom Dec 12 '08 at 09:55
3
The easiest way to do a GET is to use the built in java.net.URL. However, as mentioned, httpclient is the proper way to go, as it will allow you among others to handle redirects.
For parsing the html, you can use html parser.

kgiannakakis
- 103,016
- 27
- 158
- 194