I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?
I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).
For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì
symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?
Edit
After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.
However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.