Word Interop - Can you tell if a byte[] array of a Word Document is HTML?

Question

I'm working with a code base which, in a nutshell, is responsible for presenting documents in a web-based viewer, with thumbnails for each page number. Loading strategies and calculation of the number of pages in the document are segregated per document type and convert the documents into a common format for presentation.

The problem I'm working with concerns the initial number of page calculation for some Word Documents. These documents are stored in a 3rd-party database which presents, amongst other things, a binary stream of the document and an extension (always 'doc'). To calculate the number of pages for the document, we use Microsoft Office Interop as follows:

    public int GetPageCount(byte[] file)
    {
        var filePath = Path.GetTempFileName();
        File.WriteAllBytes(filePath, file);

        return this.GetPageCount(filePath);
    }

    public int GetPageCount(string filePath)
    {
        try
        {
            this.OpenDocument(filePath);
            const WdStatistic statistic = Microsoft.Office.Interop.Word.WdStatistic.wdStatisticPages;
            var pages = Document.ComputeStatistics(statistic, Type.Missing);

            return pages;
        }
        finally
        {
            //Closes handles, removes temp files, implementation omitted for brevity
            this.DisposeDocument();
            this.DisposeApplication();
        }
    }

    private void OpenDocument(string filePath)
    {
        // Create a new Microsoft Word application object
        this.Word = new Application();
        this.Word.Visible = false;
        this.Word.ScreenUpdating = false;

        object refFilePath = filePath;

        object html  = WdOpenFormat.wdOpenFormatWebPages;

        this.Document = this.Word.Documents.Open(ref refFilePath, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing);

        if (Document == null)
        {
            throw new Exception(string.Format("Could not open Word document ({0})", filePath));
        }
    }

The majority of documents processed by this code are normal Word Documents which work fine. However, some of these documents are actually HTML documents saved as Word Documents and unfortunately, this code using wdstatisticpages incorrectly deduces that these documents only have 1 page. I'm not sure whether something is missing from this existing code that will enable the interaction with the Interop library to correctly determine the page count for HTML, this does seem the most simple option.

As an alternative, I considered whether it might be possible to determine whether the byte array can parse to HTML; we have a rendering strategy for .html files but this isn't being used owing to the 'doc' strategy being inferred from the database. Converting the binary of the HTML documents into a string gives us the raw HTML and I wondered if something clever like a regex or a few of the 3rd party libraries out there might be viable. I'd have no trouble going with either, but I was wondering if there was something graceful in .NET that could do this a little better. It would be nice to not introduce a dependency or lean on a regex if something native to .NET were available. Something like:

    public bool IsHtml(byte[] file)
    {
        var fileString = Encoding.UTF8.GetString(file); 
        //Validate the fileString; how do we determine that the GetString() method correctly parsed and is not garbage?
        //return answer
    }

I should point out that an alternative option is to have the supplier of the 3rd-party database change their data to be more correct e.g. store 'html' as its extension. But the curious sole in me wondered whether handling the discrepancy in code was actually possible and clean enough to warrant consideration. I did some research and searching on StackOverflow but couldn't quite find anything relating to this query.

Thanks for any help and ideas. Please ask if you want any more information or details.

score 0 · Answer 1 · edited May 23 '17 at 12:16

0

Just on theory you should be able to try and use the overloads of XDocument.Load() to attempt a load of the file into an xml object since HTML is xml, assuming its valid html.

Really most the xml classes could be used in an attempt to figure this out, especially if you have the string already, you just have to assume invalid xml means its actually a word doc.

Edit: Crap now realizing that newer word formats are also XML so this probably wont work....HOWEVER I believe using the HtmlAgilityPack you could use a similar idea to figure this out

Also see this thread for some ideas on various 3rd party and .net tricks that could be helpful -> What is the best way to parse html in C#?

edited May 23 '17 at 12:16

Community

1
1

answered Oct 14 '16 at 17:53

Paul Swetz

2,234
1
11
28

1

Use HtmlParser instead of xml, simple error like
or
without being close will blow your xml parser.
– mybirthname Oct 14 '16 at 17:57
1

Valid HTML is not guaranteed to be valid XML. In fact, probably the majority of valid HTML out there is not valid XML. – JLRishe Oct 14 '16 at 18:01
Thanks for the input; XML validation was an initial consideration, but it doesn't hold much practical weight. It looks like a 3rd-party or dependency to do the work, or fixing the root cause (the data!) to be correct :) – Gary Page Oct 21 '16 at 11:22

Word Interop - Can you tell if a byte[] array of a Word Document is HTML?

1 Answers1