5

I am parsing HTML string using iTextSharp XMLWorker in my WPF application using the below code:

var css = "";
using (var htmlMS = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(html)))
{                    
    //Create a stream to read our CSS
    using (var cssMS = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(css)))
    {                        
        //Get an instance of the generic XMLWorker
        var xmlWorker = XMLWorkerHelper.GetInstance();

        //Parse our HTML using everything setup above
        xmlWorker.ParseXHtml(writer, doc, htmlMS, cssMS, System.Text.Encoding.UTF8, fontProv);                        
    }
}

The parsing works fine but it is really slow, it takes around 2 seconds to parse the HTML. So for a 50 page pdf, it takes around 2 minutes. I am using inline styling to in my HTML string. Is this the natural behaviour or it can be optimized?

nvoigt
  • 75,013
  • 26
  • 93
  • 142
user2877090
  • 73
  • 2
  • 7
  • 4
    Is your HTML deeply nested? For instance, is everything wrapped in a giant DIV? In those cases the parser (and even regular desktop browsers) has to get all the way to the end of the document before it can render the first thing. Are you using tables? PDFs don't have a concept of tables so iText has to simulate them which can be computationally expensive if they're long. Are you using images? If so, iText has to load/download the images (depending on how they're referenced) which will also take time. – Chris Haas Jan 22 '14 at 16:00
  • I don't have an answer yet, but I'm seeing how incredibly slow this library is, too. Most of the time is eaten up in the following method internal to ParseXHtml: iTextSharp.text.FontFactoryImp.RegisterDirectories – Josh Mouch Jun 20 '14 at 14:48
  • I'm finding the XMLWorkerHelper instance super slow only when running my application in debug mode. – John K Sep 29 '14 at 18:01
  • See also Java iText issue - http://stackoverflow.com/q/15621218/179972 – John K Sep 29 '14 at 18:19

1 Answers1

7

The question is wrong in the sense that it suggests that the HTML parsing is slowing everything down. That's not true. The bottleneck occurs even before the first snippet of HTML is parsed.

You are using the most basic handful of lines of code to create your PDF from HTML as demonstrated in the ParseHtml example:

public void createPdf(String file) throws IOException, DocumentException {
    // step 1
    Document document = new Document();
    // step 2
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    // step 3
    document.open();
    // step 4
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(HTML));
    // step 5
    document.close();
}

This code is simple, but it performs a lot of operations internally as explained in the comments of this other question: XMLWorkerHelper performance slow.

The act of registering font directories consumes plenty of time. You can avoid this, by using your own FontProvider as is done in the ParseHtmlFonts example.

public void createPdf(String file) throws IOException, DocumentException {
    // step 1
    Document document = new Document();

    // step 2
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
    writer.setInitialLeading(12.5f);

    // step 3
    document.open();

    // step 4

    // CSS
    CSSResolver cssResolver = new StyleAttrCSSResolver();
    CssFile cssFile = XMLWorkerHelper.getCSS(new FileInputStream(CSS));
    cssResolver.addCss(cssFile);

    // HTML
    XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
    fontProvider.register("resources/fonts/Cardo-Regular.ttf");
    fontProvider.register("resources/fonts/Cardo-Bold.ttf");
    fontProvider.register("resources/fonts/Cardo-Italic.ttf");
    fontProvider.addFontSubstitute("lowagie", "cardo");
    CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
    HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
    htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());

    // Pipelines
    PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
    HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
    CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

    // XML Worker
    XMLWorker worker = new XMLWorker(css, true);
    XMLParser p = new XMLParser(worker);
    p.parse(new FileInputStream(HTML));

    // step 5
    document.close();
}

In this case, we instruct iText DONTLOOKFORFONTS, thus saving an enormous amount of time. Instead of having iText looking for fonts, we tell iText which fonts we're going to use in the HTML.

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Thank you for providing a definitive answer from the source - I will start broadening my code base accordingly. – John K Oct 06 '14 at 17:30