9

Ugh. Word is notorious for its bloated, convoluted, non-standards-compliant, non-semantic HTML. Unfortunately, I have a professor who is requiring us to generate an outline to very exacting standards. I'd rather not hand-write it, so I decided to make something that would be useful for my classmates as well. I created the outline using a simple numbered list in NeoOffice on my Mac, exported it as HTML, and wrote quite a bit of CSS to style it. Then, I got someone to create an ordered list in Word for Windows, export it as html, and send it to me to check compatibility. After scrolling miles down the page, trying to repress a shudder, I saw a problem. Word did not use <ol> and <li>. It used mountains of nested <span>s with classes out the wazoo. I hate to see all my work go to waste, but this content is impossible to work with—I'd have to style on a document-to-document basis, rather than with a universal stylesheet.

Ideally, Word would generate HTML using standard tags so that I could style it just like any other list, but this doesn't seem to be the case. How can I make it generate lists that actually use <ul> and <li> rather than <span>, or at least modify something in my code to somehow work with the way weird way it does create lists?

Kara
  • 6,115
  • 16
  • 50
  • 57
Walker
  • 1,215
  • 2
  • 13
  • 26
  • Related: http://stackoverflow.com/questions/4824619/batch-conversion-of-docx-to-clean-html Might also help: http://stackoverflow.com/questions/1255738/tinymce-and-importing-copy-paste-from-microsoft-word – thirtydot Jan 30 '11 at 19:55
  • It isn't perfectly clear to me still what exactly you are trying to accomplish. – reisio Jan 30 '11 at 20:08
  • See the last lines. I clarified. – Walker Jan 30 '11 at 20:30
  • 1
    It's still not at all clear to me why you're using Word rather than a tool that generates standard HTML if standard HTML is what you want, but if that's really a requirement, this seems like a better question for SuperUser. – Chuck Jan 31 '11 at 18:04
  • 1
    Walker, it isn't clear why you need to use Word or create a list of items. Are you saying your professor has specifically said you must use Word, and a list? If not, please be clear what _is_ being asked of you. – reisio Feb 01 '11 at 00:01
  • I'd love nothing better than to just write some good old HTML, but I'm trying to make a more-or-less drag-and-drop system for non-technical users. They feel comfortable in nothing but Word, so I had wanted to use that as the list generator and then have them open it in notepad and paste the CSS between the style tags. I suppose that it technically wouldn't be too hard to just tie in a WYSIWYG editor as the generator instead and face the trade-off of them not having to see the CSS but being forced to use a program on the internet *gasp* to create the original list. – Walker Feb 01 '11 at 05:24

9 Answers9

4

The guys who wrote Winword and its HTML generation are smart guys. If it was easy to use HTML features in a purist way they would have done so.

Word is about creating paper-optimised layouts. it supports concepts such as tab-stops and multi-level numbering that HTML doesn't support, or is only just starting to. As a result, the HTML version of a Word document is not 'nice' HTML, but an attempt to retain the features of the Word document accurately.

When Word re-opens an HTML file it has saved, it does some clever reverse-engineering on the document, so that renders in Word looking pretty much like it started. Equally, if you insert the HTML as a snippet into a web-page, retaining Word CSS, the results are pretty faithful. In this case there is a culture clash between the underlying CSS of the webpage and Word's CSS, and some effort is required to make the best of a bad job. The Word HTML doesn't use UTF-8 either, which needs some handling.

HTMLTidy can be used to rip out Word mark-up, but some more massaging is required after this for good rendering within a webpage. I have worked on a product for 15 years which does this mixing of Word and web pages, and the results can be quite good if you fine tune the CSS.

We used Word because we are creating paper-versions, and importing text from reports written in Word, not because we couldn't find a dedicated HTML editor.

I would not recommend using Word to create tidy purist HTML. You wouldn't use a can-opener to open a bottle of wine, would you?

Life would be much simpler if: a) Microsoft re-engineered the myriad options on its highly confusing 'bullets and number' feature, b) HTML provided native, and properly featured, multi-level numbering support, instead of the after-thought approaches currently available. The weakness of HTML in this area can be seen in the flimsy numbering options available in Google Docs.

So much has improved with HTML 5, maybe we can hope that HTML 6 will help bridge the word processor / HTML editor divide.

Herc
  • 527
  • 4
  • 9
1

Use this resource http://word2cleanhtml.com/ to convert Word documents to clean HTML. Very useful, in my opinion.

Tural Ali
  • 22,202
  • 18
  • 80
  • 129
0

You can link an external stylesheet to an HTML document in Work under the Developer tab -> Document Template -> Linked CSS. You can then use this to override almost any style generated by Word.

Credit: https://superuser.com/questions/65107/how-to-apply-external-css-stylesheet-to-document-in-microsoft-word/65144#65144

Note: I did this using Word 2013, but it is not a new feature.

Community
  • 1
  • 1
Droj
  • 3,541
  • 3
  • 26
  • 19
0

If you can get your hands on a Windows PC, use Notepad++ (http://notepad-plus-plus.org/) to paste the code, and then select the plugin to format the code.

Teknophilia
  • 758
  • 10
  • 23
  • Is this a plugin that comes standard with Notepad++? I use Textmate on the Mac so I'm wondering if there would be an equivalent for what you're suggesting. I'm not sure if it would address the problem of Word's mangled, non-semantic mark up though? – Walker Feb 01 '11 at 05:25
  • I believe so. Go to Menu>TestFX>HTML Tidy>Tidy: Reindent XML. As for what Word is doing, you might just have to copy everything into notepad to lose the extra formatting code that Word adds, and then paste that into notepad++ where you can reformat it. I would then recommend that you take noted using Notepad++. – Teknophilia Feb 01 '11 at 22:14
  • 1
    I just found out about another alternative to notepad++ for macs. You have TextWrangeler (http://www.barebones.com/products/textwrangler/), gedit (http://projects.gnome.org/gedit/), and Macpad (http://sourceforge.net/projects/macpp/). Macpad says its notepad++ for macs, so it might have HTML Tidy as well. – Teknophilia Feb 01 '11 at 22:19
0

Use a WYSIWYG editor as the list generator. This would remove the need for the users to deal with raw CSS, at the cost of taking them out of the comfort zone of Microsoft Word.

Walker
  • 1,215
  • 2
  • 13
  • 26
0

Creative use of Word's Find and Replace might also work. For example, open the HTML file with NotePad, copy and paste the text back into a Word document. Open Find and Replace. If the HTML looks like this (for instance), with "This is the first line of text" being the first line item:

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span...(Cut due to berevity)...
-height:115%'>This is the first line of text<o:p></o:p></span></p>

Then find and replace with Wildcards on for \<p*line-height:115%'\ and replace with nothing. It may take a series of Finds/Replaces. The HTML markup is copious but everything else equal, it is consistent at least.

ForEachLoop
  • 2,508
  • 3
  • 18
  • 28
0

If you've got dreamweaver handy, there is a magic "clean up word HTML" button that does wonders in this scenario.

Wyatt Barnett
  • 15,573
  • 3
  • 34
  • 53
0

MSWord is only as smart as the author - an ordered list is coverted as such into HTML only if it was created in MSWord as such. This means that a list must be formatted as such per MSWord constructs and not how it is displayed on the page. Many people will create lists that "appear" to be ordered or undordered using tabs and other formatting and not using MSWord list functions. Saving to HTML tries to save it as it was written, not how it was displayed.

StrangeDucks
  • 106
  • 1
  • 3
0

From doing some research, it appears that the approach of converting the document to HTML isn't practical. Word is simply too variable in its approach to file saving and HTML generation for a single document, not to mention differences among different versions of Word. Similar to Wyatt's suggestion, there may be ways to clean up the code, but none of them are perfect. Digging around the API may provide a way to parse this more easily, but it may turn out that this is in practice just as convoluted. It seems that using word as a list-generation tool simply is unrealistic.

Walker
  • 1,215
  • 2
  • 13
  • 26