3

I've started today using docx4j;

I've succesfully created a document with a table, fed with content coming from an external source.

This content has simple HTML inside, for example a column may contain a String like:

String content = "Hello&nbsp;<strong>Word</strong><br>";

If I put this String in the column with the createParagraphOfText() method:

Tc tableCell = factory.createTc();    
tableCell.getContent().add(
    wordMLPackage.getMainDocumentPart().createParagraphOfText(content)
);
tableRow.getContent().add(tableCell);

it is rendered as-is in the Word document (as expected):

Hello&nbsp;<strong>Word</strong><br>

What I'm trying to achieve is to place in the document the rendered HTML, to get the following output:

Hello Word


I've searched on StackOverflow and the Web, and tried almost all of the examples found, but the informations are quite fragmented, and before digging more deeply I would like to know at least if I'm in the right direction.

I've added the docx4j-ImportXHTML jar to Maven, but in the docs it states that the content must be a well-formed XHTML, while I have only a bunch of text and HTML mixed together.

Also many of the (few) examples using it consist of taking an existing XML file to convert it to docx, while I'm good with fully creating the docx manually, and only need to render a single String containing HTML. Is it possible with this module ?

I've also seen that there are other docx4j modules (eg. xhtmlrenderer), but I'm not sure about which is the good one.

Does someone know the right procedure to add chunks of HTML in a table('s cell) during an iteration ?

Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243
  • Did you try using a `AlternativeFormatInputPart `? – Ascalonian Jan 27 '15 at 18:00
  • Or this? http://stackoverflow.com/questions/24628367/docx4j-replace-variable-with-html – Ascalonian Jan 27 '15 at 18:03
  • Thank you, I'll try it now. The 2nd is what I was trying, I stopped and asked here before proceeding because I cannot guarantee my source as XHTML, nor as well-formed... apart from opening/closing tag, that I can assume are correctly nested. So if I get
    , that is perfectly valid HTML5, it will break because of the missing void element self-closing slash ?
    – Andrea Ligios Jan 28 '15 at 09:53
  • @Ascalonian I've tried without altering the content, and it fails as expected because it's not valid XHTML, raising `org.xml.sax.SAXParseException: Content is not allowed in prolog.` I'll try now the `AlternativeFormatInputPart` way and let you know – Andrea Ligios Jan 28 '15 at 10:14
  • Look forward to knowing :-) – Ascalonian Jan 28 '15 at 11:24

3 Answers3

3

You have a choice to make:

  • convert your (X)HTML to docx content yourself, or
  • let Word do it

Doing it yourself gives you greater control, and means downstream processing will work (eg convert to PDF) without having to open the docx in Word first.

Letting Word do it is the AlternativeFormatInputPart (altChunk) approach.

My advice would be to do it yourself if you can. And I'd suggest you use docx4j-ImportXHTML for that.

I've added the docx4j-ImportXHTML jar to Maven, but in the docs it states that the content must be a well-formed XHTML, while I have only a bunch of text and HTML mixed together.

You can use one of the "tidy" libraries to convert to XHTML. Since there are quite a few of these, we leave which you use and how you configure it up to you.

only need to render a single String containing HTML. Is it possible with this module ?

ConvertInXHTMLFragment.java is an example.

I've also seen that there are other docx4j modules (eg. xhtmlrenderer), but I'm not sure about which is the good one.

docx4j-ImportXHTML is dependent on that.

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84
  • Great answer! Could you please explain if (and possibly how) it's possible to use the docx4j-importXHTML approach during the construction of tables ? Thank you very much – Andrea Ligios Jan 29 '15 at 00:21
  • 1
    XHTMLImporter.convert returns List, so you can use addAll to add those objects to your table cell's content list. – JasonPlutext Jan 29 '15 at 02:42
  • It's working! TBH, I had instinctively tried the `.addAll()` of TableCell with the result of `ImportXHTML.convert()` method in an earlier attempt, but I forgot to enclose my content in a `
    ` to make it valid, getting unspeaking errors... so naive :) Today, it complained about unresolved `à`, so I enclosed my content (of each cell) in an entire XHTML page, with XHTML 1.0 transitional DTD. Now it generates the document, but it strips the latin accents (à is à). BTW I'm really near the end of this. I'll provide an answer too when it'll be 100%, again, thank you very much!
    – Andrea Ligios Jan 29 '15 at 11:37
1

If you have simple HTML instead of XHTML, like

String content = "Hello&nbsp;<strong>Word</strong><br>";

the solution is to encapsulate your HTML into an HTML element, eg. a div:

String content = "<div>" + content + "</div>";

and manually replace unclosed void elements, eg.:

content = content.replaceAll("<br>", "<br/>");

At this point, you might get errors for unrecognized HTML entities, eg latin accents (&agrave; and so on). You can then surround your code with an HTML document with a DTD declaration, instead than a div. End of story.

Working example:

private void whatever(){

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    pkg = WordprocessingMLPackage.createPackage(PageSizePaper.A4,true);
    factory = Context.getWmlObjectFactory();

    Tbl table = factory.createTbl();
    for (Item item : Items){       
        Tr tableRow = factory.createTr();
        Tc tableCell = factory.createTc();

        /* This is the core of the problem */
        String content = wrapXHTML(item.getContent());
        List<Object> objects = importer.convert(wrapToXHTML(content), null);
        tableCell.getContent().addAll(objects);     
        /* problem solved */

        tableRow.getContent().add(tableCell);
        table.getContent().add(tableRow);
    }        
    pkg.getMainDocumentPart().addObject(table);
    pkg.save(baos);
}

private String wrapXHTML(String content) {
    content = content.replaceAll("<br>", "<br/>");
    /* ... other substitutions ... */

    return dtd + html + head + start + content + end;
}

private final static String dtd = 
                     "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\""
                     + " \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">";
private final static String html = "<html xmlns=\"http://www.w3.org/1999/xhtml\">";
private final static String head = "<head></head>";
private final static String start = "<body><div>";
private final static String end = "</div></body></html>";
Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243
-1

The HTML text should be well formatted like below, in my case &nbsp; was not working so I removed it.

  String content = "<html>Hello <strong>Word</strong><br></html>";

XHTML is used for conversion of html to xhtml

XHTMLImporter xHTMLImporter = new XHTMLImporterImpl(wordPackage);
        Tc tableCell = factory.createTc(); 

This is the change you need in your code

tableCell.getContent().add(wordMLPackage.getMainDocumentPart().
          getContent().addAll(xHTMLImporter.convert(content, null)));
        tableRow.getContent().add(tableCell);

This code works for me, please try this.

MLavoie
  • 9,671
  • 41
  • 36
  • 56