3

I'm using openhtmltopdf to transform html to pdf. Currently I'm getting an exception if the html contains german characters, like for example ä,ö,ü.

  PdfRendererBuilder builder = new PdfRendererBuilder();
  builder.useFastMode();
  builder.withHtmlContent(html,"file://localhost/");
  builder.toStream(out);
  builder.run();

org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 31; The entity "auml" was referenced, but not declared.

Here my html:

<html>
   <head>      
      <meta charset="UTF-8" />
    </head>
    <body>
        k&auml;se
    </body>
</html>

The exported word is "käse" (cheese).


UPDATE

I have tried with an entity resolver, in this way:

 DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
    DocumentBuilder builder=null;
    try{
      builder=factory.newDocumentBuilder();

      ByteArrayInputStream input=new ByteArrayInputStream(html.getBytes("UTF-8"));
      builder.setEntityResolver(FSEntityResolver.instance());
      org.w3c.dom.Document doc=builder.parse(input);


    }catch(Exception e){
      logger.error(e.getMessage(),e);
    }

but I'm still getting the same exception at "parse".

Neo
  • 1,337
  • 4
  • 21
  • 50

1 Answers1

5

Looks like you either need to provide DTD or replace the entity name auml with its corresponding hex or decimal value, i.e. &#xE4; or &#228; respectively. See A.2. Entity Sets and HTML 4 Entity Names.

The html content would look like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
        <!ENTITY auml "&#228;">
]>
<html>
    <head>
    </head>
    <body>
        k&auml;se
    </body>
</html>

Alternatively, you can run through the html string and replace the entity names with their corresponding dec/hex values, which should be fine, or just prepend the DTD to your html string before passing it to the pdf builder.


Update

You might want to give the jsoup library a try. It It parses and provides you with a org.w3c.dom.Document, e.g.

Document jsoupDoc = Jsoup.parse(html); // org.jsoup.nodes.Document
W3CDom w3cDom = new W3CDom(); // org.jsoup.helper.W3CDom
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

You can then pass the w3cDoc to the pdf builder like so

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(w3cDoc, "file://localhost/");
Kenan Güler
  • 1,868
  • 5
  • 16
  • Your answer goes in the right direction, thx. I'm pretty sure I can do it programmatically, instead of declaring the DTD in the html. I have tried using an entity resolver (I have updated my question), still not working, but I think I'm closer... – Neo Mar 17 '20 at 09:30
  • @Zardo the `javax.xml.parsers.DocumentBuilder` you are using requires a well defined document, which is not the case with the html file you provided. I updated my answer. `jsoup` would help you with the html parsing part, so that you don't have to touch your existing html files. – Kenan Güler Mar 17 '20 at 22:27