16

I want to save the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?

Dan McClain
  • 11,780
  • 9
  • 47
  • 67
Pankaj
  • 9,749
  • 32
  • 139
  • 283
  • http://stackoverflow.com/questions/161791/word-97-2003-document-to-html-conversion-programatically-closed – Jørn Schou-Rode Feb 15 '10 at 13:21
  • Is this an exercise, or do you just want to translate from .doc to .html and the method doesn't really matter? – dnagirl Feb 15 '10 at 13:22
  • No, I want to add description - At the client end MS-Word is not installed , so i have to complete the job using the word viewer component only – – Pankaj Feb 15 '10 at 13:29

10 Answers10

26

For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}
gunr2171
  • 16,104
  • 25
  • 61
  • 88
  • 3
    PowerTools for Open XML just released a new HtmlConverter module that contains an open source, free implementation of a conversion from DOCX to HTML formatted with CSS. The module HtmlConverter.cs supports all paragraph, character, and table styles, fonts and text formatting, numbered and bulleted lists, images, and more. See http://bit.ly/1bclyg9 – Eric White Jan 31 '14 at 11:02
6

You can try with Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }
Bimzee
  • 1,138
  • 12
  • 15
  • this one works very good for me, however, when saving the document it creates a files folder with information not required for opening the html produced. Any thoughts on that? – Ricker Silva Feb 07 '18 at 21:41
  • This works good for me, however, be careful, if you are processing many documents, you can use only one instance of Application object, creating that object is almost a full second wasted each time. besides, it is no t easy to get rid of the process entirely so the less the better for your RAM – Ricker Silva Mar 09 '18 at 18:45
  • Plus, this requires Word to be installed, isn't it? The OP asked specifically, no Word. – Mircea Ion Mar 03 '20 at 20:01
3

I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.

Michael Williamson
  • 11,308
  • 4
  • 37
  • 33
1

I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).

ZombieSheep
  • 29,603
  • 12
  • 67
  • 114
  • Correct, docx files are XML. The format differs from Word 2003 to 2007 and is a complete pain to work with! – roryf Feb 15 '10 at 13:20
  • Yes, rename the .docx extension to .zip and you can access all the files that make up the document. But without the full version of word and COM interop, he's going to have a hard time trying to acheive his goal from the XML. +1 btw, as it's the only way he's going to do it without Word. – Bryan Feb 15 '10 at 13:21
  • Yes, At the client end MS-Word is not installed , so i have to complete the job using the word viewer component only – Pankaj Feb 15 '10 at 13:24
  • If it's stored in docx format you can open and manipulate the XML without using Word Viewer or COM interop, otherwise this cannot be done without Word. @Bryan FYI, docx 2003 isn't a zip archive it's just an XML file with base64 encoded images. – roryf Feb 18 '10 at 14:05
  • @Rory Fitzpatrick, Try renaming a .docx to .zip and take a look for yourself. http://www.google.co.uk/search?q=.docx+rename+to+.zip – Bryan Feb 18 '10 at 16:57
  • @Bryan you're right, I was confusing .docx with Word 2003 .xml format – roryf Feb 18 '10 at 17:20
  • @Rory Fitzpatrick: Ah okay, fair enough. Your spot on about the XML though, as it's definitely in the .zip file. – Bryan Feb 18 '10 at 17:30
0

According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.

Community
  • 1
  • 1
Bryan
  • 3,224
  • 9
  • 41
  • 58
  • Thanks for the reply. But i don;t have MS word installed on te machine. so i have to do this using the Word viewer only – Pankaj Feb 15 '10 at 13:21
  • That's what I'm saying - I don't believe it is possible without the full version of word. You could have a go using ZombieSheep's answer, but I doubt you will get very far TBH. It would make more sense to buy a copy of Word and use COM interop. – Bryan Feb 15 '10 at 13:23
  • Yes, At the client end MS-Word is not installed , so i have to complete the job using the word viewer component only – Pankaj Feb 15 '10 at 13:26
  • 1
    You *can't* do it with word viewer. Period. – Bryan Feb 15 '10 at 13:29
0

If you're open to not using C#, you could do something like print to file using PrimoPDF (which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.

dnagirl
  • 20,196
  • 13
  • 80
  • 123
0

Another similar topic which I have got is Convert Word to HTML then render HTML on webpage. I think you might find this helpful if you are still on it. There's a freely distributed dll for this. I have given the link there.

Community
  • 1
  • 1
0

Gembox works pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.

Mike W
  • 425
  • 3
  • 12
-2

You will need to have MS Word installed to do this, I believe.

Check out this article for details on the implementation.

Tim S. Van Haren
  • 8,861
  • 2
  • 30
  • 34
  • Thanks for the reply. But i don;t have MS word installed on te machine. so i have to do this using the Word viewer only – Pankaj Feb 15 '10 at 13:17
-2

Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

ternaryOperator
  • 833
  • 4
  • 10