15

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.

I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.

I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.

http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190

This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.

http://openxmldeveloper.org/articles/333.aspx

This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.

After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).

Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
Omega192
  • 343
  • 1
  • 5
  • 15
  • You say at the beginning that this is a process that you're doing manually, but then you're not happy with the memory-intensive Word automation solution. Why? If you're not selling this as a commercial product why does the efficacy of the solution matter? You're turning a laborious manual process into an automated one, who cares if it takes a minute per document - it's still going to be miles quicker. – Keith Jan 28 '11 at 08:55
  • True, I'm not selling it as a commercial product. However, I intend to share it with my coworkers, and I'd prefer to offer them an efficient program. My personal computer may be modern and up to specs to handle such things, but I have no idea about how theirs will handle it. Another issue is the dependency on Word. I'm assuming they all own a copy, but that's just an assumption. I'd like to offer them a program that will work efficiently regardless of what computer they run it on. – Omega192 Jan 28 '11 at 19:55

3 Answers3

6

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx

The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Alec Gorge
  • 17,110
  • 10
  • 59
  • 71
  • Oh wow. I really don't know how I never came across this in my searching. I guess I was looking for docx to html rather than Open XML/WordprocessingML to XHTML. I haven't had the chance to implement this yet, but it looks like exactly what I'm looking for. Thank you very much! :D – Omega192 Jan 28 '11 at 20:09
  • 1
    Excellent! Once you get this program done I am sure many people would love to hear if this works out. Maybe once you get the program done you could post the source code somewhere or something. Good luck! – Alec Gorge Jan 29 '11 at 14:07
  • Drat, could have sworn I posted an update. For some reason HtmlConverterSettings and HtmlConverter are giving me errors about a missing assembly. I've referenced all four assemblies that first link tells me to, except OpenXmlPowerTools is actually OpenXml.PowerTools when I import the .dll I contacted Eric White about it, but I haven't heard from him since I replied to his original reply. – Omega192 Apr 06 '11 at 07:40
  • 1
    I've done a full adaptation and implementation of this project...it's great stuff. Did you ever get yours sorted out? – Chris B. Behrens Jul 08 '11 at 16:30
  • @ChrisB.Behrens Drat, just now saw your comment. I could never solve that missing assembly issue so I gave up on a pretty solution and opened the gateway to the underworld and did this with RegEx. The input is fairly controlled, so it worked wonders. – Omega192 Apr 23 '12 at 06:52
  • @ChrisB.Behrens Hey Chris. Do you happen to have the source code of your implementation? I'm trying to do the same thing it would help me a lot. – Ansh Saini Jan 24 '23 at 11:35
2

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:

  1. Open the Word document with Aspose.Words.
  2. Save the Word document as HTML.
  3. Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.

Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
  • 1
    Yeah, I did come across some commercial solutions, though not Aspose.Words specifically. This is just a personal project I'm doing to help my coworkers and myself out, so I'm relying on my college student budget of $0 :P I appreciate your time to answer my question though, thank you! – Omega192 Jan 28 '11 at 19:57
2

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.

I am a web developer who had the same issues, so I created my own tool: http://www.convertwordtohtml.com

We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.

There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

  • 1
    I'm fairly certain it is perfectly acceptable to do that. It appears that you have made a very nice piece of software, unfortunately I don't have money to buy a license. Thank you for your post, though! – Omega192 Feb 22 '11 at 04:05