html parsing exceptions in iText

Question

I have a p:editor in primefaces where the users are pasting word documents having some email templates and saving it in DB.

Now I need to convert this content into pdf. But what I am getting returned from DB is an HTML conversion of that word document.

While parsing this HTML content with iText, I am running into lot of errors beacause of invalid xhtml like below

<span style="font-family: Arial, Verdana; font-size: 13.3333px;"><img src="9#credit_cards_logos#9"></span>

With above snippet, I am getting error invalid span tag. Expected closing img tag. When I remove span tag around img, it works fine.

Now errors like this are all over the place. And it's not possible to manually go and fix all of them as it's a huge template (there are 100s of templates.)

Here is my function which I am using to parse it.

public StreamedContent getFile() throws IOException, DocumentException{
        final PortletResponse portletResponse = (PortletResponse) FacesContext.getCurrentInstance().getExternalContext()
                .getResponse();
        final HttpServletResponse res = PortalUtil.getHttpServletResponse(portletResponse);
        res.setContentType("application/pdf");
        res.setHeader("Cache-Control", "no-store, no-cache, must-revalidate");
        res.setHeader("Content-Disposition", "attachment; filename=" + subject + ".pdf");
        res.setHeader("Refresh", "1");
        res.flushBuffer();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        OutputStream out = res.getOutputStream();
        Document document = new Document(PageSize.LETTER);
        PdfWriter pdfWriter =PdfWriter.getInstance(document, baos);
        document.open();
        document.addCreationDate();
        XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
        //htmlWorker.parse(new StringReader(getMessage()));
        worker.parseXHtml(pdfWriter, document, new StringReader(getMessage()));
        document.close();
        baos.writeTo(out);
        out.flush();
        out.close();
        return null;
    }

Is there a workaround to this ?

EDIT____________

Is there something like p:dataExporter(only for datatables) in primefaces which will convert the contents into pdf without the need to parse the HTML.

You are using ye olde iText 5 with XMLWorker. XMLWorker expects 100% valid HTML, so you need to clean up your HTML before you give it to XMLWorker. You can use something like JSoup or JTidy. The alternative is iText 7 + pdfHTML, which can handle invalid HTML a lot better. — Amedee Van Gasse, Oct 15 '18 at 08:24
You asked the same question twice: https://stackoverflow.com/questions/52809656/invalid-span-tag-expected-closing-br-tag — Amedee Van Gasse, Oct 15 '18 at 08:26
In my comment on your question https://stackoverflow.com/questions/52773998/extracting-table-from-html-string-and-generating-pdf-using I already recommended switching to iText 7 + pdfHTML. — Amedee Van Gasse, Oct 15 '18 at 09:21
@Amedee, Do I need a purchase any license to use iText7 & pdfHTML ? — Naxi, Oct 15 '18 at 09:59
It depends. Is your own software open source under the AGPL license? Then you are already licensed to use iText. Is your software distributed under a license that is not compatible with AGPL (for example a commercial license)? Then you need to purchase a commercial license of iText. — Amedee Van Gasse, Oct 15 '18 at 10:16
Ahh... iText is not an option for me then. Do you recommend anything else which can deal with such weird html and is open source ? — Naxi, Oct 15 '18 at 10:55
That would go against my employer's interests, so no, I cannot recommend an alternative even if I knew any. — Amedee Van Gasse, Oct 15 '18 at 10:57
Also, you are currently using XMLWorker, which is part of iText 5, which is also AGPL. So you currently already require a valid license. — Amedee Van Gasse, Oct 15 '18 at 10:59
Good implementation of iText7, refer this https://stackoverflow.com/a/57251780/14784590 — Reejesh, Jul 04 '23 at 07:05

score 1 · Answer 1 · answered Oct 15 '18 at 08:32

1

The answer to

Is there something like p:dataExporter(only for datatables) in primefaces which will convert the contents into pdf without the need to parse the HTML.

is: No, there is not

answered Oct 15 '18 at 08:32

Kukeltje

12,223
4
24
47

score 0 · Accepted Answer · answered Oct 16 '18 at 07:05

0

This worked for me : html2pdfrocket

https://www.html2pdfrocket.com/convert-java-html-to-pdf

They have a free tier available.

answered Oct 16 '18 at 07:05

Naxi

1,504
5
33
72

html parsing exceptions in iText

2 Answers2

Linked