iTextSharp XMLParser.Parse throws and swallows exceptions repeatedly

Question

I have a function which I am using to try to give iTextSharp some HTML and from it generate a PDF. This function successfully generates the PDF, complete with CSS styling, but it is not running fast enough for our requirements.

I have noted the one area in particular which is taking a long time to execute is the call to XMLParser.Parse which I have seen taking upwards of 14 seconds to complete for an 8 page document showing little more than a table of data headed by icons. During the execution of this method I have noticed (in the output window) three exceptions being thrown (and presumably caught) by iTextSharp or code which iTextSharp calls into. These exceptions are:

'System.Collections.Generic.KeyNotFoundException' in mscorlib.dll
'iTextSharp.tool.xml.exceptions.NoDataException' in itextsharp.xmlworker.dll
'System.ArgumentException' in mscorlib.dll

The three exceptions repeat (albeit not necessarily in that order) until the Parse method has finished executing.

Whilst I realize I don't need to handle these exceptions myself, I mention them since I am trying to improve the performance of this method and understand that catching exceptions can be an expensive operation. What I am looking for is what the cause of these exceptions being thrown is and if it is dependent upon bad data I am passing in, what data would it be that is bad?

This is the HTML to PDF function as it currently stands. Note I have already tried using XMLWorkerFontProvider.DONTLOOKFORFONTS, as suggested by the itext_so manual, but not made any performance gains from doing so. I have also noted a NullReferenceException thrown and swallowed once byPdfWriter.GetInstance; I wondered if this may also related to the thrown exceptions in the Parse method.

public static byte[] GeneratePdfFromHtml(string html, Action<PdfWriter, Document> pdfSettings = null, string additionalFooterText = null)
{
    var tagProcessor = (DefaultTagProcessorFactory)Tags.GetHtmlTagProcessorFactory();

    tagProcessor.RemoveProcessor(HTML.Tag.IMG);
    tagProcessor.AddProcessor(HTML.Tag.IMG, new CustomProcessorImageTag());

    using (var workStream = new MemoryStream())
    using (var document = new Document())
    //NOTE: The NullReferenceException is thrown via this call.
    using (PdfWriter writer = PdfWriter.GetInstance(document, workStream))
    {
        PdfEventHelper pdfEventHelper = new PdfEventHelper(additionalFooterText);

        writer.PageEvent = pdfEventHelper;
        writer.CloseStream = false;
        pdfSettings?.Invoke(writer, document);
        document.Open();

        var xmlWorkerFontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);

        //Not noticably faster without this font directory registration.
        xmlWorkerFontProvider.RegisterDirectory("~/Content/fonts", false);

        //TODO: If further performance is needed then this line is the next slowest (.43ms).
        var htmlContext = new HtmlPipelineContext(new CssAppliersImpl(xmlWorkerFontProvider));

        htmlContext.SetTagFactory(tagProcessor);

        Func<string, string> mapPath = HttpContext.Current.Server.MapPath;
        var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);

        foreach (var cssFileName
            in new[]
            {
                "bootstrap.min.css",
                "Pdf.css",
                "Rdp.css"
            })
            cssResolver.AddCssFile(mapPath($"~/Content/{cssFileName}"), true);

        using (var reader = new StringReader(html))
        {
            new XMLParser(
                new XMLWorker(
                    new CssResolverPipeline(
                        cssResolver,
                        new HtmlPipeline(
                            htmlContext,
                            new PdfWriterPipeline(
                                document,
                                writer))),
                    true))
            //TODO: Speed up this line - this is the slowest line in the method by far.
            //NOTE: This throws a series of ArgumentExceptions, NoDataExceptions and KeyNotFoundExceptions.
            .Parse(reader);

            document.Close();

            return workStream.ToArray();
        }
    }
}

Is there a specific reason why you chose an old version of iText that is no longer supported instead of the newer iText 7 and pdfHtml add-on? See https://stackoverflow.com/questions/47895935/converting-html-to-pdf-using-itext for more info. — Bruno Lowagie, Apr 30 '18 at 17:07
I was unaware of iText 7; when looking to upgrade, Nuget Package Manager suggested the latest I could upgrade version 5.5.9 of iTextSharp to was 5.5.13. Would I be correct in thinking that the Nuget Package I need is entitled itext7 and is currently on version 7.1.2 along with itext7.pdfhtml version 2.0.2? — Matt Arnold, May 01 '18 at 09:57
I noticed you're the founder of iText from that link; just as a suggestion to avoid future instances of people using the wrong library I suggest it being noted on the iTextSharp Nuget Package's Description that it is now obsolete and which package to replace it with. — Matt Arnold, May 01 '18 at 10:00
The description of iText 5 on NuGet (https://www.nuget.org/packages/itextsharp) mentions iText 7. Concrete proposals to improve the text are welcome. — Amedee Van Gasse, May 18 '18 at 09:50
Thanks, I didn't notice the mention of iText 7 the first time I looked at the description. I suggest moving the line "iTextSharp is the .NET port of iText 5." into the short summary description seen immediately when selecting the package from NuGet Package Manager. I would also accompany it with "This package is now obsolete and has been superseded by the iText7 NuGet package." or alternative wording to indicate that updates will no longer be pushed to that package. — Matt Arnold, May 18 '18 at 11:40

iTextSharp XMLParser.Parse throws and swallows exceptions repeatedly

0 Answers0