I'm working with a code base which, in a nutshell, is responsible for presenting documents in a web-based viewer, with thumbnails for each page number. Loading strategies and calculation of the number of pages in the document are segregated per document type and convert the documents into a common format for presentation.
The problem I'm working with concerns the initial number of page calculation for some Word Documents. These documents are stored in a 3rd-party database which presents, amongst other things, a binary stream of the document and an extension (always 'doc'). To calculate the number of pages for the document, we use Microsoft Office Interop as follows:
public int GetPageCount(byte[] file)
{
var filePath = Path.GetTempFileName();
File.WriteAllBytes(filePath, file);
return this.GetPageCount(filePath);
}
public int GetPageCount(string filePath)
{
try
{
this.OpenDocument(filePath);
const WdStatistic statistic = Microsoft.Office.Interop.Word.WdStatistic.wdStatisticPages;
var pages = Document.ComputeStatistics(statistic, Type.Missing);
return pages;
}
finally
{
//Closes handles, removes temp files, implementation omitted for brevity
this.DisposeDocument();
this.DisposeApplication();
}
}
private void OpenDocument(string filePath)
{
// Create a new Microsoft Word application object
this.Word = new Application();
this.Word.Visible = false;
this.Word.ScreenUpdating = false;
object refFilePath = filePath;
object html = WdOpenFormat.wdOpenFormatWebPages;
this.Document = this.Word.Documents.Open(ref refFilePath, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing);
if (Document == null)
{
throw new Exception(string.Format("Could not open Word document ({0})", filePath));
}
}
The majority of documents processed by this code are normal Word Documents which work fine. However, some of these documents are actually HTML documents saved as Word Documents and unfortunately, this code using wdstatisticpages incorrectly deduces that these documents only have 1 page. I'm not sure whether something is missing from this existing code that will enable the interaction with the Interop library to correctly determine the page count for HTML, this does seem the most simple option.
As an alternative, I considered whether it might be possible to determine whether the byte array can parse to HTML; we have a rendering strategy for .html files but this isn't being used owing to the 'doc' strategy being inferred from the database. Converting the binary of the HTML documents into a string gives us the raw HTML and I wondered if something clever like a regex or a few of the 3rd party libraries out there might be viable. I'd have no trouble going with either, but I was wondering if there was something graceful in .NET that could do this a little better. It would be nice to not introduce a dependency or lean on a regex if something native to .NET were available. Something like:
public bool IsHtml(byte[] file)
{
var fileString = Encoding.UTF8.GetString(file);
//Validate the fileString; how do we determine that the GetString() method correctly parsed and is not garbage?
//return answer
}
I should point out that an alternative option is to have the supplier of the 3rd-party database change their data to be more correct e.g. store 'html' as its extension. But the curious sole in me wondered whether handling the discrepancy in code was actually possible and clean enough to warrant consideration. I did some research and searching on StackOverflow but couldn't quite find anything relating to this query.
Thanks for any help and ideas. Please ask if you want any more information or details.
or
without being close will blow your xml parser.
– mybirthname Oct 14 '16 at 17:57