I'm looking for a good Java api to do web scraping. I tried WEB-Harvest api http://web-harvest.sourceforge.net/usage.php but I think it's a bit clunky. Any other suggestions?
Asked
Active
Viewed 3,044 times
3
-
4"Any other suggestions?" Just one. Note that when searching for info. on this topic, that word is 'scraping' (one 'p'), not 'scrapping' (which is a separate word that means 'fighting' or 'dumping'). – Andrew Thompson Mar 09 '11 at 18:53
-
1Possible duplicate of [How to "scan" a website (or page) for info, and bring it into my program?](http://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program). See also this [recent question](http://stackoverflow.com/questions/5240981/how-to-easily-parse-html-for-consumption-as-a-service-using-java) for another example. Note that you're basically asking "What is the best HTML parser in Java?". – BalusC Mar 09 '11 at 18:58
-
you can follow [Web scraping with Java][1] [1]: http://stackoverflow.com/questions/3202305/web-scraping-with-java – Sumit Ramteke Sep 15 '14 at 13:22
-
Comparing libraries is generally off-topic here. See the [Software Recommendations Stack Exchange](https://softwarerecs.stackexchange.com/) instead. – Basil Bourque Jun 09 '17 at 20:14
3 Answers
0
I use this: https://github.com/subes/invesdwin-webproxy
It supports HttpClient and HtmlUnit (headless browser that supports javascript) and parallelizes it if required over a large pool of proxies. I can also recommend JSoup for static html processing.

subes
- 1,832
- 5
- 22
- 28
0
http://hc.apache.org/httpcomponents-client-ga/
(Maven Dependency)
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>

BZ.
- 1,928
- 2
- 17
- 26