5

I'm trying to open .doc file and read its content. But i can't find any way how to do this without launching MSWord.

Now I have following code:

Microsoft.Office.Interop.Word.Application app = new Microsoft.Office.Interop.Word.Application();
object nullObject = System.Reflection.Missing.Value;
object file = @"C:\doc.doc";
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(ref file, ref nullObject, ref nullObject,
         ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject,
         ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject, ref nullObject,
         ref nullObject);
doc.ActiveWindow.Selection.WholeStory();
doc.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
string text = data.GetData(DataFormats.Text).ToString();
doc.Close(ref nullObject, ref nullObject, ref nullObject);
app.Quit(ref nullObject, ref nullObject, ref nullObject);

But it launches MSWord, any solution to do it without launching?

Alexis Pigeon
  • 7,423
  • 11
  • 39
  • 44
Vitali Fokin
  • 203
  • 3
  • 13

3 Answers3

3

Two possibilities: either use Microsoft's spec to write your own parser for the .doc format, or use an existing library for the purpose (e.g., from Aspose). Unless you have a couple of spare years to spend on the task, the latter is clearly the correct choice.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
1

Add the Namespace using Add Reference-->Browse-->Code7248.word_reader.dll

Download dll from the given URL :

sourceforge.net/p/word-reader/wiki/Home

(A simple .NET Library compatible with .NET 2.0, 3.0, 3.5 and 4.0 for C#. It can currently extract only the raw text from a .doc or .docx file.)

The Sample Code is in simple Console in C#:

using System;
using System.Collections.Generic;
using System.Text;
//add extra namespaces
using Code7248.word_reader;


namespace testWordRead
{
    class Program
    {
        private void readFileContent(string path)
        {
            TextExtractor extractor = new TextExtractor(path);
            string text = extractor.ExtractText();
            Console.WriteLine(text);
        }
        static void Main(string[] args)
        {
            Program cs = new Program();
            string path = "D:\Test\testdoc1.docx";
            cs.readFileContent(path);
            Console.ReadLine();
        }
    }
}

It is working fine.

1

Last time I did this (via COM from C++), I recall a 'Visible' property in the Application interface (true=visible).

However, it seems to me that the default was false, so you had to set it to true to make Word appear.

Regardless of whether or not the user can see Word, you will still see winword.exe (or whatever it's called today) in your task manager. I don't think there's a way to access Word through this interface without it launching Word (behind the scenes or not).

If you don't want Word to launch at all, you may have to find another solution.

Marc Bernier
  • 2,928
  • 27
  • 45
  • visibillity is enabled as default, so i can see msword anyway. even i set visibility to false, window appears and quickly collapses. – Vitali Fokin Sep 20 '10 at 21:02
  • i need to proceed lots of doc files, i takes too much time to launch word everytime – Vitali Fokin Sep 20 '10 at 21:04
  • Strange about the visibility property. I am using an older version of office (2003), maybe they changed the default. COM is very slow, you may be able to re-use some of the objects; the application object I think can stay alive as you cycle through each document. It may help a little. – Marc Bernier Sep 29 '10 at 17:50