11

I have a tough project in my pipeline and I'm not sure where to begin. My boss wants the ability to display a Word Document in HTML and it look the same as the word document.

After trying time after time to just let me show the word document in a pop up or a light box he is stuck on stripping out the contents of the word converting it to HTML saving that in a database and then displaying it as HTML on a webpage.

Can you guys either give me some good ammo as to if showing the word document is better (less cumbersome, less storage space more secure etc).

Or if it's pretty easy to convert a word document to HTML ways for me to do that.

The technologies I current have are Entity Framework, LINQ, MVC, C#, Razor.

We currently use HTmlAgilityPack, but this strips out all of the formatting and doesn't allow the document to show very well.

Dave Bish
  • 19,263
  • 7
  • 46
  • 63
James Wilson
  • 5,074
  • 16
  • 63
  • 122
  • You can use: Task Schedule, a MS WORD Macro (for saving .doc to .html), and a simple batch file (for doing the xcopy of the files to your IIS server) – BrOSs Aug 15 '13 at 16:32
  • Are the Word docs arbitrary or do they all follow a particular pattern? I saw you mention in another comment that all of the docs have images. I'm just curious if they all follow a particular pattern or template. – randcd Aug 15 '13 at 16:43
  • @randcd no pattern is followed. It's a bunch of how to documents created by 10-30 different people. – James Wilson Aug 15 '13 at 16:46
  • Is editing the docx (or some rendition of it eg html) in the browser a requirement? – JasonPlutext Aug 15 '13 at 21:57
  • @JasonPlutext editing it is not a requirement. They wish to only display parts of it to specific people and feel it is easier to achieve this through HTML so only displaying it in HTML is a requirement. – James Wilson Aug 15 '13 at 22:04

6 Answers6

7

We use http://www.aspose.com/ (I think the one we use is Aspose words) to perform s similar task, and it works quite well. (there is a cost involved)

I would suggest that converting to HTML gives the worst rendition of the document. One solution we use, is to generate a Jpeg image of the document and display that.

If you need to be able to perform operations like find and copy/pasting text - I would recommend converting the document to a .pdf, and displaying it inline, in whichever standard pdf viewer the client machine has installed.

Dave Bish
  • 19,263
  • 7
  • 46
  • 63
7

If you are using DOCX you can allways use Open XML SDK from Microsoft, it's pretty easy to use and clean. A sample taken from MSDN

// This example shows the simplest conversion. No images are converted.
// A cascading style sheet is not used.
byte[] byteArray = File.ReadAllBytes("Test.docx");
using (MemoryStream memoryStream = new MemoryStream())
{
    memoryStream.Write(byteArray, 0, byteArray.Length);
    using (WordprocessingDocument doc =         WordprocessingDocument.Open(memoryStream, true))
    {
        HtmlConverterSettings settings = new HtmlConverterSettings()
        {
            PageTitle = "My Page Title"
        };
        XElement html = HtmlConverter.ConvertToHtml(doc, settings);

        // Note: the XHTML returned by ConvertToHtmlTransform contains objects of type
        // XEntity. PtOpenXmlUtil.cs defines the XEntity class. See
        // http://blogs.msdn.com/ericwhite/archive/2010/01/21/writing-entity-references-using-linq-to-xml.aspx
        // for detailed explanation.
        //
        // If you further transform the XML tree returned by ConvertToHtmlTransform, you
        // must do it correctly, or entities do not serialize properly.

        File.WriteAllText("Test.html", html.ToStringNewLineOnAttributes());
    }
}

You might also want to take a look to the Word automation services http://blogs.office.com/b/microsoft-word/archive/2009/12/16/word-automation-services_3a00_-what-it-does.aspx

Gonzix
  • 1,136
  • 5
  • 8
1

If your boss is dead-set on displaying it in HTML, then getting the HTML generated by the word doc into your database is the hardest part of the project.

You have a couple of workflows to choose from, but they go something like this:

  1. User saves to .Doc to .HTML >> user uploads doc to database thru app you create >> web app pulls the HTML from the database to display on web page

  2. User saves .Doc >> user uploads doc thru app you create >> the app converts the doc on the fly and then inserts HTML into database >> web app pulls the HTML from the database to display on the web page

  3. User saves and uploads .Doc file to database >> web app pulls the doc and converts it on the fly when its requested by a web page

  4. etc etc etc

Unfortunately, you're in for a bit of tomfoolery no matter which workflow you choose. @DaveBish suggested using a 3rd party tool, which I completely agree with as being the best way to handle the conversion (if you don't require your users to save their docs to HTML). Also, be aware that images in Word documents can be problematic when you've converted to HTML (they aren't preserved in the generated file, which means more /sarcasm/ fun for you on the web dev side).

If your boss doesn't want to foot the bill for a 3rd party converter, you can attempt to handle the conversion on your own with the Office.Interop namespace [insert blah about how this is a terrible idea blah blah]...in which case, this answer will probably be of great use to you.

Community
  • 1
  • 1
Daniel Szabo
  • 7,181
  • 6
  • 48
  • 65
0

You can also go through Free Spire.Doc for more support

0

I've used GemBox.Document, it can embed the images from Word document within the HTML file itself.
For example, like this:

MemoryStream docxStream = null; // Your DOCX file's path or stream.
DocxLoadOptions docxOptions = new DocxLoadOptions();

// Load DOCX file.
DocumentModel document = DocumentModel.Load(docxStream, docxOptions);

MemoryStream htmlStream = new MemoryStream();
HtmlSaveOptions htmlOptions = new HtmlSaveOptions();
htmlOptions.EmbedImages = true;
htmlOptions.HtmlType = HtmlType.HtmlInline;

// Save HTML file.
document.Save(htmlStream, htmlOptions);

Also, by using HtmlType.HtmlInline I get a HTML content that can be placed on an existing page (like in a viewer or WYSIWYG editor). Check out the rest of the HtmlSaveOptions properties.

You can find more examples of this approach on Convert between Word and HTML and Word Editor in ASP.NET MVC.

hertzogth
  • 236
  • 1
  • 2
  • 19
0

This is an old post, but I just wrote an app that converts a Word-doc to a usable web-page. The app provides some of the requirements in the OP.

The app is WordWebNav (WWN). It's free and open-source.

WWN provides a Word VBA program that converts Word-docs to Word-HTML.

WWN also provides a Python program that converts the Word-HTML to a usable web-page:

  • It adds missing features to the Word-HTML, e.g., a navigation pane.
  • And, WWN fixes some common bugs in Word's HTML, e.g., mis-formatted lists, and overly-wide paragraphs.

The Python program uses a CLI, and it can be called externally.

JimYuill
  • 76
  • 4