4

Is there a way to translate a Microsoft word document to a string without using the Microsoft COM component? I am hoping there is some other way to deal with all of the excess markup.

EDIT 12/13/13: We didn't want to reference the com component because if the customer didn't have the exact same version of office installed it wouldn't work. Luckily Microsoft has made the 2013 word.interop.dll backward compatible. Now we don't have to worry about this restriction. Once referencing the dll we can do the following:

/// <summary>Gets the content of the word document</summary>
/// <param name="filePath">The path to the word document file</param>
/// <returns>The content of the document</returns>
public string ExtractText(string filePath)
{
    if (string.IsNullOrEmpty(filePath))
        throw new ArgumentNullException("filePath", "Input file path not specified.");

    if (!File.Exists(filePath))
        throw new FileNotFoundException("Input file not found at specified path.", "filepath");

    var resultText = string.Empty;
    Application wordApp = null;

    try
    {
        wordApp = new Application();
        var doc = wordApp.Documents.Open(filePath, Type.Missing, true);
        if (doc != null)
        {
            if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text))
              resultText = doc.Content.Text.Normalize();

            doc.Close();
        }
    }
    finally
    {
        if (wordApp != null)
            wordApp.Quit(false, Type.Missing, false);
    }

    return resultText;
}
Tresto
  • 151
  • 1
  • 12
  • 1
    http://archworx.wordpress.com/2007/05/10/parsing-word-document-in-c/ google gave this to me :) – Volker Mauel Jan 05 '12 at 21:33
  • 3
    T can't image what you mean by translating a rich-text formatted document into a string. However, I guess you wan't to *somehow* access the plain text content. Hence see the [docx project at Codeplex](http://docx.codeplex.com/). – Ondrej Tucny Jan 05 '12 at 21:34
  • Isn't there a set of .NET Office extensions for things like this? – McKay Jan 05 '12 at 21:53
  • 1
    It looks like the poster wants `.DOC` format which is *hugely* different from `.DOCX` format. – Mike Christensen Jan 05 '12 at 21:54
  • @VolkerMauel he wants it without using the COM component. Your link is not appropriate – nawfal Jan 05 '12 at 21:58
  • The problem is that we don't want to require that the server we install our software on to have Microsoft office installed, and we don't want to reference (and package) a specific version of Office.Interop. – Tresto Jan 05 '12 at 22:20
  • Possible duplicate? [http://stackoverflow.com/questions/3755100/...](http://stackoverflow.com/questions/3755100/reading-doc-file-without-launching-msword) – Gabriel GM Jan 05 '12 at 23:45
  • @nawfal - no, it definitely wouldn't work. It was meant as a joke. – RQDQ Jan 06 '12 at 12:14
  • @RQDQ haha, I thought may be with .net 4 and office 2010 combined there's such a neat trick provided interop was used.. :) – nawfal Jan 06 '12 at 12:41

3 Answers3

2

You will need to use some library to achieve what you want:

IF you have lots of time on your hands then writing a .DOC parser might be thinkable - the .DOC spec can be found here.

BTW: Office Interop is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !

Yahia
  • 69,653
  • 9
  • 115
  • 144
  • I would argue that writing a .DOC parser is not thinkable... Buying a component to do the conversion almost certainly has to be cheaper than the hours spent writing a converter. – RQDQ Jan 06 '12 at 12:15
  • @RQDQ I agree... although I made the experience on SO that some people only accept free components and there is none that I know of for the old .DOC-format... – Yahia Jan 06 '12 at 13:03
1

Assuming you mean to extract the text content of a doc file, there are a few command line tools as well as commercial libraries. A rather old tool that we once used to search doc (not docx) files (in combination with the search engine sphider) was catdoc (also here) which is a DOS rather than a Windows tool but nonetheless worked for us as long as we met the prerequisites (file name format 8.3).

A commercial product doc2txt if you can afford $29.

For the newer docx format, you can use the Perl based tool docx2txt.

Of course, if you want to run those tools from c#, you need to trigger an external Process - check here for a solid explanation.

A rather expensive, but very powerful tool to access doc and docx content is Spire.doc, but it does a lot more than you need. It is more convenient to use as it is a .NET library.

Community
  • 1
  • 1
Olaf
  • 10,049
  • 8
  • 38
  • 54
0

If you are referring to an older DOC file format then that is quite an issue because it is a MS specified binary file format and I must say I totally agree with the RQDQ's comment.

But if you are referring to a DOCX file format then you can achieve this without MS COM component or any other component, just pure .NET.

Check the following solutions:

http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files http://www.dotnetspark.com/kb/Content.aspx?id=5633

Mario Z
  • 4,328
  • 2
  • 24
  • 38