7

I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or a sensible approach to this problem? Platform is Java

Cœur
  • 37,241
  • 25
  • 195
  • 267
banjollity
  • 4,490
  • 2
  • 29
  • 32

5 Answers5

2

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

Craig Angus
  • 22,784
  • 18
  • 55
  • 63
  • This suggestion allowed me to build a rudimentary version of what I want in about an hour and around 100 lines of code. A winner is you! – banjollity Oct 23 '08 at 07:20
1

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

Vinze
  • 2,549
  • 3
  • 22
  • 23
0

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

Diodeus - James MacFarlane
  • 112,730
  • 33
  • 157
  • 176
0

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

Kibbee
  • 65,369
  • 27
  • 142
  • 182
0

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

Tim Howland
  • 7,919
  • 4
  • 28
  • 46