21

i want to convert a html page to docx in c#, how can i do it?

Luis
  • 2,665
  • 8
  • 44
  • 70

9 Answers9

18

My solution uses Html2OpenXml along with DocumentFormat.OpenXml (NuGet package for Html2OpenXml is here) to provide an elegant solution for ASP.NET MVC.

WordHelper.cs

public static class WordHelper
{
    public static byte[] HtmlToWord(String html)
    {
        const string filename = "test.docx";
        if (File.Exists(filename)) File.Delete(filename);

        using (MemoryStream generatedDocument = new MemoryStream())
        {
            using (WordprocessingDocument package = WordprocessingDocument.Create(
                   generatedDocument, WordprocessingDocumentType.Document))
            {
                MainDocumentPart mainPart = package.MainDocumentPart;
                if (mainPart == null)
                {
                    mainPart = package.AddMainDocumentPart();
                    new Document(new Body()).Save(mainPart);
                }

                HtmlConverter converter = new HtmlConverter(mainPart);
                Body body = mainPart.Document.Body;

                var paragraphs = converter.Parse(html);
                for (int i = 0; i < paragraphs.Count; i++)
                {
                    body.Append(paragraphs[i]);
                }

                mainPart.Document.Save();
            }

            return generatedDocument.ToArray();
        }
    }
}

Controller

    [HttpPost]
    [ValidateInput(false)]
    public FileResult Demo(CkEditorViewModel viewModel)
    {
        return File(WordHelper.HtmlToWord(viewModel.CkEditorContent),
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
    }

I'm using CKEditor to generate HTML for this sample.

DavidC
  • 654
  • 1
  • 14
  • 20
Leonel Sanches da Silva
  • 6,972
  • 9
  • 46
  • 66
12

Below does the same thing as Luis code, but just a bit more readable and applied to an ASP.NET MVC application:

var word = new Microsoft.Office.Interop.Word.Application();
word.Visible = false;

var filePath = Server.MapPath("~/MyFiles/Html2PdfTest.html");
var savePathPdf = Server.MapPath("~/MyFiles/Html2PdfTest.pdf");
var wordDoc = word.Documents.Open(FileName: filePath, ReadOnly: false);
wordDoc.SaveAs2(FileName: savePathPdf, FileFormat: WdSaveFormat.wdFormatPDF);

you can also save in other formats such as docx like this:

var savePathDocx = Server.MapPath("~/MyFiles/Html2PdfTest.docx");
var wordDoc = word.Documents.Open(FileName: filePath, ReadOnly: false);
wordDoc.SaveAs2(FileName: savePathDocx, FileFormat: WdSaveFormat.wdFormatXMLDocument);
PostureOfLearning
  • 3,481
  • 3
  • 27
  • 44
  • 2
    Remember to call `wordDoc.Close()` and `wordDoc.Quit()` to dispose of the object afterward, otherwise you are left with instances of word running in background. – Dan Diplo Jul 26 '16 at 13:25
  • 5
    Note that using `Interop.Word.Application` in an ASP.NET application is officially **unspported** and **not recommended** by Microsoft: https://stackoverflow.com/a/8709255/87698 – Heinzi Mar 29 '18 at 08:52
4

Using that code to convert

Microsoft.Office.Interop.Word.Application word = 
    new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document wordDoc = 
    new Microsoft.Office.Interop.Word.Document();
Object oMissing = System.Reflection.Missing.Value;
wordDoc = word.Documents.Add(ref oMissing, ref oMissing, ref oMissing, ref oMissing);
word.Visible = false;
Object filepath = "c:\\page.html";
Object confirmconversion = System.Reflection.Missing.Value;
Object readOnly = false;
Object saveto = "c:\\doc.pdf";
Object oallowsubstitution = System.Reflection.Missing.Value;

wordDoc = word.Documents.Open(ref filepath, ref confirmconversion, 
    ref readOnly, ref oMissing,
    ref oMissing, ref oMissing, ref oMissing, ref oMissing,
    ref oMissing, ref oMissing, ref oMissing, ref oMissing,
    ref oMissing, ref oMissing, ref oMissing, ref oMissing);
 object fileFormat = WdSaveFormat.wdFormatPDF;
 wordDoc.SaveAs(ref saveto, ref fileFormat, ref oMissing, ref oMissing, ref oMissing,
     ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing,
     ref oMissing, ref oMissing, ref oMissing, ref oallowsubstitution, ref oMissing,
     ref oMissing);
Mark Schultheiss
  • 32,614
  • 12
  • 69
  • 100
Luis
  • 2,665
  • 8
  • 44
  • 70
2

The OpenXML SDK allows you to programmatically build docx documents:

OpenXml SDK Download

Gibsnag
  • 1,087
  • 9
  • 17
1

You might consider using altChunk. See, amongst others, adding images to openxml doc created from altchunk

If you don't want to rely on Word to convert the HTML, you could try docx4j-ImportXHTML for .NET; see this walkthrough.

Community
  • 1
  • 1
JasonPlutext
  • 15,352
  • 4
  • 44
  • 84
0

Using office applications on the web server is not recommended by Microsoft. however this can be done fairly easily using the OpenXML 2.5

All you have to really do is split the HTML by ("<", ">") then for each part shove it into a switch and identify if it is a HTML tag or not.

Then for each part you can start converting the HTML to "Run" and "RunProperties" and the non-html text is simply placed into the "Text"

It sounds harder then it is... and yes I have no idea why there isn't code available to do exactly this.

Things to keep in mind. The two formats do not cleanly convert into each other, so if you focus on the cleanest code possible you will run into issue where the format its self becomes messy.

0

Aspose.Words for .NET is a commercial component allowing you to achieve this.

Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • A sample for [converting the HTML to Word document](http://www.aspose.com/documentation/.net-components/aspose.words-for-.net/aspose.words.loadformat.html) using Aspose.Words for .NET can also be viewed. – Shahzad Latif Aug 23 '11 at 16:57
  • 1
    I had some trouble with ASPOSE going from html to docx, like styling and image formatting issues that seemed pretty basic to me and they deemed them as product limitations... – Ariel May 24 '12 at 16:10
  • Agreed. The lack of support for css, even embedded css, means that you have to format all tables, paragraphs and even lists yourself. – nullnvoid Nov 18 '15 at 01:28
0

MigraDoc can help. Or using VS tools for Office. Or connecting to Office via COM.

Sasha Reminnyi
  • 3,442
  • 2
  • 23
  • 27
-2

You may consider using PHPDocX that offers a very convenient tool to convert HTML files and/or HTML strings into WordML.

It has plenty of options among them:

  1. you can filter using CSS style selector which chunks of HTML should be inserted into the Word document.
  2. You may choos if download the image or letthem as external links.
  3. It parses HTML forms.
  4. You may use native Word styles for tables and paragraphs overwritting the original CSS.
  5. Transforms HTML anchors in Word bookmarks.
  6. etcetera

I hope you find it useful :-)

Eduardo
  • 90
  • 1
  • 2
    This is not related to C# but instead PHP. Please respond to what the OP is asking. – Sha Dec 29 '20 at 11:07