22

I have been researching about the headless browsers available till to date and found HtmlUnit being used pretty extensively. Do we have any alternative to HtmlUnit with possible advantage compared to HtmlUnit?

Thanks Nayn

Nayn
  • 3,594
  • 8
  • 38
  • 48

6 Answers6

8

As far as I know, HtmlUnit` is the most powerful headless browser.

What are you issues with it?

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
  • 4
    There are two killer features of HtmlUnit for me: 1. it is o/s independent 2. it doesn't use "real" browser as a backend. As the result there is zero-configuration and no surprises on application deployment. And it does it's job quite well. – barti_ddu Nov 30 '10 at 00:08
  • Issues with HtmlUnit : http://sourceforge.net/tracker/?group_id=47038&atid=448266 – Nayn Nov 30 '10 at 08:39
  • 3
    Major issue is that it sometimes renders web pages different than how it could look in real browser. It also alters the page/tag structure. Also I want to execute javascript which has some issues with HtmlUnit – Nayn Nov 30 '10 at 08:41
  • 2
    HtmlUnit is helpless f.g. against blog.com. It crashes on any JavaScript error, f.g. wordpress.com can't be loaded because gravatar JavaScript is blocked in my network. – Danubian Sailor Dec 23 '10 at 11:01
  • 6
    you can use `webClient.setThrowExceptionOnScriptError(false);` to effectively ignore Javascript errors. – anton1980 Aug 30 '12 at 01:32
  • the real issue with HtmlUnit is how difficult it is to extend some parts of it. if you wanted to change HOW javascript is processed - you are in for a lot of pain. – anton1980 Aug 30 '12 at 01:34
  • AngularJS support in HtmlUnit is very poor. So I am also looking for an alternative for scraping AJAX web sites. I need to to work on GAE Java, but so far I haven't found an alternative. – Splaktar Apr 19 '14 at 15:14
  • yeah... headlessbrowser is important for testing in background.... but when javascript feature is not in fully-feature mode... it's a nightmare for every developers... :( – gumuruh Sep 28 '16 at 03:52
5

There are many other libraries that you can use for this.

  • If you need to scrape xml base data use JTidy.
  • If you need to scrape specific data from HTML you can use Jsoup.

Well I use jsoup - it's pretty much faster than any other API.

Kariem
  • 4,398
  • 3
  • 44
  • 73
Sajid Hussain
  • 400
  • 3
  • 10
  • 4
    Jsoup is great but I guess that cannot crawl a site based on AJAX requests. If it's about clicking on elements and waiting for other html code to appear and evaluate it, IMHO it's not an alternative. – frandevel Apr 27 '13 at 20:23
4

WebDriver with a virtual framebuffer is the only real alternative. The advantage is that it uses a real browser; the disadvantage is that it's more of a pain to set up, and the API is much poorer.

Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
3

I am going to use Selenium for my use case, since it offers me to use the real browser and no deviation from what it would render in real world as compared to HtmlUnit. I am planning to use Selenium2 which has WebDriver integration and offers great API and cool fixes. Thanks Nayn

Nayn
  • 3,594
  • 8
  • 38
  • 48
  • 2
    this is what I would recommend too. htmlunit's javascript engine seems to crash, a lot, on real world sites. – Joel Dec 27 '10 at 22:31
  • 1
    Selenium is fine...unless you want to work with e.g. SmartGWT JavaScript components...or unless you want to deploy it in a continuous integration environment in a reasonable amount of time...or if you want to run stress tests without a 500-CPU cluster as a test runner etc. – Tomislav Nakic-Alfirevic Sep 27 '12 at 13:11
  • so what are the alternative of htmlunitdriver? Bcoz several web required the javascript to be fully working... :( – gumuruh Sep 28 '16 at 03:52
2

I use webkit as a headless browser, through Qt's Python bindings: http://www.riverbankcomputing.co.uk/static/Docs/PyQt4/html/qtwebkit.html

Webkit is the render engine used by Chrome and Safari, and is very flexible.

One of my reasons for choosing it over HtmlUnit was ease of setting up:

sudo apt-get install python-qt4
hoju
  • 28,392
  • 37
  • 134
  • 178
2

I would also recommend Selenium. The great feature is you can create a client that opens a browser page that you can see what's happening at each step. Moreover, creating macros for automated tests is another good feature. However, if you need to scrap some information from web page HtmlUnit is better than selenium.

glassfish
  • 21
  • 2