Html to Pdf with german alphabet

Question

I'm using openhtmltopdf to transform html to pdf. Currently I'm getting an exception if the html contains german characters, like for example ä,ö,ü.

  PdfRendererBuilder builder = new PdfRendererBuilder();
  builder.useFastMode();
  builder.withHtmlContent(html,"file://localhost/");
  builder.toStream(out);
  builder.run();

org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 31; The entity "auml" was referenced, but not declared.

Here my html:

<html>
   <head>      
      <meta charset="UTF-8" />
    </head>
    <body>
        k&auml;se
    </body>
</html>

The exported word is "käse" (cheese).

UPDATE

I have tried with an entity resolver, in this way:

 DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
    DocumentBuilder builder=null;
    try{
      builder=factory.newDocumentBuilder();

      ByteArrayInputStream input=new ByteArrayInputStream(html.getBytes("UTF-8"));
      builder.setEntityResolver(FSEntityResolver.instance());
      org.w3c.dom.Document doc=builder.parse(input);


    }catch(Exception e){
      logger.error(e.getMessage(),e);
    }

but I'm still getting the same exception at "parse".

Do you have `` in your HTML-Document where you want to create the PDF? — Norbert Bartko, Mar 09 '20 at 14:47

Kenan Güler · Answer 1 · 2020-03-17T22:21:35.017

Looks like you either need to provide DTD or replace the entity name auml with its corresponding hex or decimal value, i.e. ä or ä respectively. See A.2. Entity Sets and HTML 4 Entity Names.

The html content would look like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
        <!ENTITY auml "&#228;">
]>
<html>
    <head>
    </head>
    <body>
        k&auml;se
    </body>
</html>

Alternatively, you can run through the html string and replace the entity names with their corresponding dec/hex values, which should be fine, or just prepend the DTD to your html string before passing it to the pdf builder.

Update

You might want to give the jsoup library a try. It It parses and provides you with a org.w3c.dom.Document, e.g.

Document jsoupDoc = Jsoup.parse(html); // org.jsoup.nodes.Document
W3CDom w3cDom = new W3CDom(); // org.jsoup.helper.W3CDom
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

You can then pass the w3cDoc to the pdf builder like so

PdfRendererBuilder builder = new PdfRendererBuilder();
builder.withW3cDocument(w3cDoc, "file://localhost/");

Your answer goes in the right direction, thx. I'm pretty sure I can do it programmatically, instead of declaring the DTD in the html. I have tried using an entity resolver (I have updated my question), still not working, but I think I'm closer... — Neo, Mar 17 '20 at 09:30
@Zardo the `javax.xml.parsers.DocumentBuilder` you are using requires a well defined document, which is not the case with the html file you provided. I updated my answer. `jsoup` would help you with the html parsing part, so that you don't have to touch your existing html files. — Kenan Güler, Mar 17 '20 at 22:27

Html to Pdf with german alphabet

1 Answers1