3

I am looking for the best Java lib which I can pass in a URL and have it create an image of what the web page looks like as it would in a browser. I tried out flyingsaucer however it seems like almost every web page breaks it -- it wont even render www.google.com or yahoo.com -- the only site i could get it to render is www.w3c.org!

Thoughts on a better tool to use, or possibly allow flying saucer to be more lax in the xhtml is accepts?

Ismael Abreu
  • 16,443
  • 6
  • 61
  • 75
empire29
  • 3,729
  • 6
  • 45
  • 71

3 Answers3

5

Flying Saucer fails on many pages since it only allows xhtml (see manual).

But you can use some html libs to "clean" your input an then use FS.

Webesite -> "Cleaner" -> Flying Saucer

Some good and free libs are:

  1. JSoup (personal recommendation)
  2. HtmlCleaner
  3. JTidy (sometimes more strict than needed)
  4. Jericho HTML
ollo
  • 24,797
  • 14
  • 106
  • 155
0

may be you can try itext.jar

download it from http://itextpdf.com/download.php

Chris Thompson
  • 35,167
  • 12
  • 80
  • 109
0

about html crawling:

use URL from java library. there are tons of examples about this.

about PDF converting:

If you are using Spring framework, you can use AbstractPdfView class via iText api. this is my favorite example. I think you can easily make use of it.

about image converting:

I recommend this one: http://code.google.com/p/java-html2image/

total:

read html by URL → convert it via iText or java-html2image. I strongly recommend you to do it yourself, not leave it to a certain library.

Lee Dongjin
  • 161
  • 1
  • 6