1

I creating a C# application that has to create a word document.

I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.

I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.

How can I tell word to reformat the text as html?

edit: (trough the C# code of course)

edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Pieter888
  • 4,882
  • 13
  • 53
  • 74
  • possible duplicate of [How to convert HTML file to word?](http://stackoverflow.com/questions/1624485/how-to-convert-html-file-to-word), read those answers there, they are providing alternative ways which will also work using C# – Doc Brown Apr 01 '11 at 13:38
  • this is not a duplicate... I clearly format my question more detailed, I'm using C# and I'm not asking for a library to do this. – Pieter888 Apr 01 '11 at 13:42
  • your edit shows (more than before) that you should use a library for your task instead of going the HTML route. And there is a C# port of Apache POI available, which should solve your performance issues with Interop, look here http://stackoverflow.com/questions/2680546/where-to-get-apache-poi-port-for-net – Doc Brown Apr 01 '11 at 13:45

6 Answers6

5

If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.

Word will open that just fine.

If you need to add a page break, you can use a CSS page-break-before, like so:

<br style="page-break-before: always;"/>

If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:

  • you paste HTML from the Clipboard
  • open/insert HTML from a file

So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop

However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.

Community
  • 1
  • 1
Town
  • 14,706
  • 3
  • 48
  • 72
  • Haha, nice I did not know that, sweet cheat! But that doesn't solve my problem, because I need to output multiple pages and I can't tell word trough html to create a new page. – Pieter888 Apr 01 '11 at 13:34
  • I don't know about Word documents, but I've run into terrible trouble writing out HTML content and giving the file a .xls or .xlsx extension - Office 2007 gives a nice "The file you are trying to open .xlsx is in a different format than specified by the file extension" error, which often doesn't receive focus. – Ian Pugsley Apr 01 '11 at 13:37
  • Oh this answer worked just fine, but it's not exactly what I was looking for, because I need to be able to tell when to resume on a new page. – Pieter888 Apr 01 '11 at 13:39
  • @Pieter888: if dealing with pages is your only issue, take a look at http://www.w3.org/TR/CSS21/page.html. More specifically, `page-break-before:always` is already used by Word when you insert a page-break on a document and save it as HTML, so it should be able to understand it when opening a document ;) – Edurne Pascual Apr 01 '11 at 13:44
  • @herenvardo: It appears from [this question](http://stackoverflow.com/questions/4896863/insert-a-page-break-in-a-generated-html-doc) that it doesn't work, which is a shame! – Town Apr 01 '11 at 13:49
  • That `page-break-before:always` did it for me thank you very much, If you could post it in an answer I'd accept it to give you some points if you wish. – Pieter888 Apr 01 '11 at 13:52
  • @herenvardo: I (or, more specifically, the poster of that question) stand corrected! @Pieter888: good stuff, glad you got it sorted. – Town Apr 01 '11 at 13:54
  • @Pieter888: I think the question should be: Why the f*ck don't unicorns appear when it's not April 1st?! :D – Town Apr 01 '11 at 13:56
  • @town: just tested it, and it _does_ work, with just a caveat: Word defaults to opening html files on "web design" view, wich doesn't _render_ page-breaks. Either printing or switching to "Print design" view properly shows the break. This snippet is enough to test it out: `asdf
    new page!`.
    – Edurne Pascual Apr 01 '11 at 14:00
  • @herenvardo: That's good news - I hadn't tested it and was quite surprised when I read that other question stating that it *didn't* work. As you said, if Word uses it to denote a page break when it saves as HTML, it makes sense that it would use it to denote a page break when it loads HTML too. – Town Apr 01 '11 at 14:03
  • @town: conclusion: never trust the asker more than the actual program ;). It works, but by default you don't see it. – Edurne Pascual Apr 01 '11 at 14:14
  • @herenvardo: lol! Yes, true - never trust anything you read on the internet ;) – Town Apr 01 '11 at 14:15
1

Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.

See this answer here MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge

And See this from the mothership:

http://msdn.microsoft.com/en-us/library/ff433638.aspx

Community
  • 1
  • 1
Mesh
  • 6,262
  • 5
  • 34
  • 53
1

If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.

You can run this macro programmatically by starting Word using the command line switch /m.

Doc Brown
  • 19,739
  • 7
  • 52
  • 88
1

Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.

If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.

David
  • 15,750
  • 22
  • 90
  • 150
1

As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):

When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:

On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.

So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).

Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>

Edurne Pascual
  • 5,560
  • 1
  • 26
  • 31
0

If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

Matthew Whited
  • 22,160
  • 4
  • 52
  • 69