5

right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.

I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.

I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.

Sincerly,

Ogofo

Ogofo
  • 356
  • 2
  • 6
  • 13

1 Answers1

7

Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).

One thing that worths trying in HtmlUnit is changing the BrowserVersion (Chrome / InternetEplorer / FireFox) while creating the WebClient instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.

Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123
  • 1
    Thx. I finished this part of the project now. HtmlUnit did not worked very well with the sites I gave him. Right now I use phantomjs, which i execute via java and let the output wirte into a .html-file. Phantomjs does his job and I dont get any error. I get nearly the exact html that I can inspect in my browser. – Ogofo Sep 28 '12 at 13:16
  • 1
    Yes, phantomjs is really cool. I didn't mention any of them because you were using pure java. Another option you can take a look at is zombie.js – Mosty Mostacho Sep 28 '12 at 13:41