Skip adding empty tables to PDF when parsing XHTML using ITextSharp

Question

ITextSharp throws an error when you attempt to create a PdfTable with 0 columns.

I have a requirement to take XHTML that is generated using an XSLT transformation and generate a PDF from it. Currently I am using ITextSharp to do so. The problem that I am having is the XHTML that is generated sometimes contains tables with 0 rows, so when ITextSharp attempts to parse them into a table it throws and error saying there are 0 columns in the table.

The reason it says 0 columns is because ITextSharp sets the number of columns in the table to the maximum of the number of columns in each row, and since there are no rows the max number of columns in any given row is 0.

How do I go about catching these HTML table declarations with 0 rows and stop them from being parsed into PDF elements?

I've found the piece of code that is causing the error is within the HtmlPipeline, so I could copy and paste the implementation into a class extending HtmlPipeline and overriding its methods and then do my logic to check for empty tables there, but that seems sloppy and inefficient.

Is there a way to catch the empty table before it is parsed?

=Solution=

The Tag Processor

public class EmptyTableTagProcessor : Table
{
    public override IList<IElement> End(IWorkerContext ctx, Tag tag, IList<IElement> currentContent)
    {
        if (currentContent.Count > 0)
        {
            return base.End(ctx, tag, currentContent);
        }

        return new List<IElement>();
    }
}

And using the Tag Processor...

        //CSS
        var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);

        //HTML
        var fontProvider = new XMLWorkerFontProvider();
        var cssAppliers = new CssAppliersImpl(fontProvider);

        var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
        tagProcessorFactory.AddProcessor(new EmptyTableTagProcessor(), new string[] { "table" });

        var htmlContext = new HtmlPipelineContext(cssAppliers);
        htmlContext.SetTagFactory(tagProcessorFactory);

        //PIPELINE
        var pipeline =
            new CssResolverPipeline(cssResolver,
            new HtmlPipeline(htmlContext,
            new PdfWriterPipeline(document, pdfWriter)));

        //XML WORKER
        var xmlWorker = new XMLWorker(pipeline, true);

        using (var stringReader = new StringReader(html))
        {
            xmlParser.Parse(stringReader);
        }

This solution removes the empty table tags and still writes the PDF as a part of the pipeline.

Chris Haas · Accepted Answer · 2014-07-01T16:59:39.147

You should be able to write your own tag processor that accounts for that scenario by subclassing iTextSharp.tool.xml.html.AbstractTagProcessor. In fact, to make your life even easier you can subclass the already existing more specific iTextSharp.tool.xml.html.table.Table:

public class TableTagProcessor : iTextSharp.tool.xml.html.table.Table {

    public override IList<IElement> End(IWorkerContext ctx, Tag tag, IList<IElement> currentContent) {
        //See if we've got anything to work with
        if (currentContent.Count > 0) {
            //If so, let our parent class worry about it
            return base.End(ctx, tag, currentContent);
        }

        //Otherwise return an empty list which should make everyone happy
        return new List<IElement>();
    }
}

Unfortunately, if you want to use a custom tag processor you can't use the shortcut XMLWorkerHelper class and instead you'll need to parse the HTML into elements and add them to your document. To do that you'll need an instance of iTextSharp.tool.xml.IElementHandler which you can create like:

public class SampleHandler : iTextSharp.tool.xml.IElementHandler {
    //Generic list of elements
    public List<IElement> elements = new List<IElement>();
    //Add the supplied item to the list
    public void Add(IWritable w) {
        if (w is WritableElement) {
            elements.AddRange(((WritableElement)w).Elements());
        }
    }
}

You can use the above with the following code which includes some sample invalid HTML.

//Hold everything in memory
using (var ms = new MemoryStream()) {

    //Create new PDF document 
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            doc.Open();

            //Sample HTML
            string html = "<table><tr><td>Hello</td></tr></table><table></table>";

            //Create an instance of our element helper
            var XhtmlHelper = new SampleHandler();

            //Begin pipeline
            var htmlContext = new HtmlPipelineContext(null);

            //Get the default tag processor
            var tagFactory = iTextSharp.tool.xml.html.Tags.GetHtmlTagProcessorFactory();

            //Add an instance of our new processor
            tagFactory.AddProcessor(new TableTagProcessor(), new string[] { "table" });

            //Bind the above to the HTML context part of the pipeline
            htmlContext.SetTagFactory(tagFactory);

            //Get the default CSS handler and create some boilerplate pipeline stuff
            var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(false);
            var pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new ElementHandlerPipeline(XhtmlHelper, null)));//Here's where we add our IElementHandler

            //The worker dispatches commands to the pipeline stuff above
            var worker = new XMLWorker(pipeline, true);

            //Create a parser with the worker listed as the dispatcher
            var parser = new XMLParser();
            parser.AddListener(worker);

            //Finally, parse our HTML directly.
            using (TextReader sr = new StringReader(html)) {
                parser.Parse(sr);
            }

            //The above did not touch our document. Instead, all "proper" elements are stored in our helper class XhtmlHelper
            foreach (var element in XhtmlHelper.elements) {
                //Add these to the main document
                doc.Add(element);
            }

            doc.Close();

        }
    }
}

I'm getting an error stating that 'This document has no pages' when I run the example you posted without using the TableTagProcessor. Even when I use just simple valid HTML. But when I do use the TableTagProcessor without the ElementHandler I don't get the error? It seems to be working though without the element handler. — Rovert Renchirk, Jul 01 '14 at 16:45
You will get the first exception if you pass *only* an invalid table in. You can trick the system by always adding a paragraph with a space in it at the end. I think you can also inspect `writer.PageEmpty`. If I comment out the `AddProcessor` line and use _everything else exactly as is including the HTML string_ I still get your original exception. Without seeing how you converted the above to not use `IElementHandler` I can't speak to the other comment. — Chris Haas, Jul 01 '14 at 17:15

Skip adding empty tables to PDF when parsing XHTML using ITextSharp

1 Answers1

Linked