3

I'm looking for a good Java api to do web scraping. I tried WEB-Harvest api http://web-harvest.sourceforge.net/usage.php but I think it's a bit clunky. Any other suggestions?

Andrew Thompson
  • 168,117
  • 40
  • 217
  • 433
finfinni
  • 39
  • 1
  • 2
  • 4
    "Any other suggestions?" Just one. Note that when searching for info. on this topic, that word is 'scraping' (one 'p'), not 'scrapping' (which is a separate word that means 'fighting' or 'dumping'). – Andrew Thompson Mar 09 '11 at 18:53
  • 1
    Possible duplicate of [How to "scan" a website (or page) for info, and bring it into my program?](http://stackoverflow.com/questions/2835505/how-to-scan-a-website-or-page-for-info-and-bring-it-into-my-program). See also this [recent question](http://stackoverflow.com/questions/5240981/how-to-easily-parse-html-for-consumption-as-a-service-using-java) for another example. Note that you're basically asking "What is the best HTML parser in Java?". – BalusC Mar 09 '11 at 18:58
  • you can follow [Web scraping with Java][1] [1]: http://stackoverflow.com/questions/3202305/web-scraping-with-java – Sumit Ramteke Sep 15 '14 at 13:22
  • Comparing libraries is generally off-topic here. See the [Software Recommendations Stack Exchange](https://softwarerecs.stackexchange.com/) instead. – Basil Bourque Jun 09 '17 at 20:14

3 Answers3

0

I use this: https://github.com/subes/invesdwin-webproxy

It supports HttpClient and HtmlUnit (headless browser that supports javascript) and parallelizes it if required over a large pool of proxies. I can also recommend JSoup for static html processing.

subes
  • 1,832
  • 5
  • 22
  • 28
0

I've used httpunit to do just this task in production.

Speck
  • 2,259
  • 1
  • 20
  • 29
0

http://hc.apache.org/httpcomponents-client-ga/

(Maven Dependency)

<dependency>
  <groupId>commons-httpclient</groupId> 
  <artifactId>commons-httpclient</artifactId> 
  <version>3.1</version> 
</dependency>
BZ.
  • 1,928
  • 2
  • 17
  • 26