0

I'm trying to read the string of text from word documents into a List Array, and then search for the word in these string of text. The problem, however, is that the word documents kept on running continuously in the windows background when opened, even though I close the document after reading the text.

Parallel.ForEach(files, file =>
{
    switch (System.IO.Path.GetExtension(file))
    {
        case ".docx":
            List<string> Word_list = GetTextFromWord(file);
            SearchForWordContent(Word_list, file);
            break;
    }
});

static List<string> GetTextFromWord(string direct)
{
    if (string.IsNullOrEmpty(direct))
    {
        throw new ArgumentNullException("direct");
    }

    if (!File.Exists(direct))
    {
        throw new FileNotFoundException("direct");
    }

    List<string> word_List = new List<string>();
    try
    {
        Microsoft.Office.Interop.Word.Application app =
            new Microsoft.Office.Interop.Word.Application();
        Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(direct);

        int count = doc.Words.Count;

        for (int i = 1; i <= count; i++)
        {
            word_List.Add(doc.Words[i].Text);
        }

        ((_Application)app).Quit();
    }
    catch (System.Runtime.InteropServices.COMException e)
    {
        Console.WriteLine("Error: " + e.Message.ToString());
    }
    return word_List;
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • afaik `Microsoft.Office.Interop` always runs microsoft word in the background. you should use something else if you dont want that happens. to ensure it closed, you can see this [QA](https://stackoverflow.com/a/6777522). you could use NPOI, DocumentFormat.OpenXML (for docx, xlsx, pptx - all openxml format only), and some others as alternative if possible. hope it helps. – Bagus Tesa Mar 18 '22 at 08:02
  • It's your own code that starts multiple instances of Word. When you use Word interop you actually start Word and use COM to talk to it. That's slow. Use a library to read/write Word files instead. `Parallel.ForEach` is misused too. It's only meant for *data* parallelism, not concurrent operations. You can use the [Office Open XML SDK](https://www.nuget.org/packages/DocumentFormat.OpenXml/) directly to read docx files, or use a library like [NPOI](https://github.com/nissl-lab/npoi/wiki/Getting-Started-with-NPOI) – Panagiotis Kanavos Mar 18 '22 at 08:04
  • When you use COM, every call, even property reading, is a cross-process call to Word. Chatty code, including chained property calls, result in far more cross-process calls. A cross-process call is orders of magnitude slower than an in-memory call. If you can't get rid of Word you'll have to write your code in a way that reduces calls eg by caching objects. If you do that you'll get better performance from a single thread than 8 threads inefficiently calling 8 Word instances – Panagiotis Kanavos Mar 18 '22 at 08:08
  • [This SO answer](https://stackoverflow.com/questions/29586919/how-do-i-count-the-number-of-words-in-a-word-document-doc-docx-when-a-user) shows how to retrieve the word count using the [Open XML SDK](https://www.nuget.org/packages/DocumentFormat.OpenXml/), without using Word itself – Panagiotis Kanavos Mar 18 '22 at 08:23

1 Answers1

0

When you use Word Interop you're actually starting the Word application and talk to it using COM. Every call, even reading a property, is an expensive cross-process call.

You can read a Word document without using Word. A docx document is a ZIP package containing well-defined XML files. You could read those files as XML directly, you can use the Open XML SDK to read a docx file or use a library like NPOI which simplifies working with Open XML.

The word count is a property of the document itself. To read it using the Open XML SDK you need to check the document's ExtendedFileProperties part :

using (var document = WordprocessingDocument.Open(fileName, false))
{
  var words = (int) document.ExtendedFilePropertiesPart.Properties.Words.Text;
}

You'll find the Open XML documentation, including the strucrure of Word documents at MSDN

Avoiding Owner Files

Word or Excel Files that start with ~ are owner files. These aren't real Word or Excel files. They're temporary files created when someone opens a document for editing and contain the logon name of that user. These files are deleted when Word closes gracefully but may be left behind if Word crashes or the user has no DELETE permissions, eg in a shared folder.

To avoid these one only needs to check whether the filename starts with ~.

  • If the fileName is only the file name and extension, fileName.StartsWith("~") is enough
  • If fileName is an absolute path, `Path.GetFileName(fileName).StartsWith("~")

Things get trickier when trying to filter such files in a folder. The patterns used in Directory.EnumerateFiles or DirectoryInfo.EnumerateFiles are simplistic and can't exclude characters. The files will have to be filtered after the call to EnumerateFiles, eg :

var dir=new DirectoryInfo(folderPath);

foreach(var file in dir.EnumerateFiles("*.docx"))
{
    if (!file.Name.StartsWith("~"))
    {
        ...
    }
}

or, using LINQ :

var dir=new DirectoryInfo(folderPath);
var files=dir.EnumerateFiles("*.docx")
             .Where(file=>!file.Name.StartsWith("~"));
foreach(var file in files)
{
    ...
}

Enumeration can still fail if a file is opened exclusively for editing. To avoid exceptions, the EnumerationOptions.IgnoreInaccessible parameter can be used to skip over locked files:

var dir=new DirectoryInfo(folderPath);
var options=new EnumerationOptions 
            { 
                IgnoreInaccessible =true
            };
var files=dir.EnumerateFiles("*.docx",options)
             .Where(file=>!file.Name.StartsWith("~"));

One option is to

  • List item
  • List item
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • Is there anyway to check if the filename has a "~" in front? I'm using Open XML SDK now, but there seems to have a problem. There's FileFormatException error when it looks into word documents filenames that has "~" in front, which it shouldn't be happening? – PohcbSonic Ziwen Mar 21 '22 at 02:52
  • Files that start with `~` aren't real Word or Excel files, they're lock files created when some other user has opened the file for editing. They're deleted when Word shuts down gracefully but will be left there if Word crashes, or if the user has no permission to delete files, eg in a shared folder. – Panagiotis Kanavos Mar 21 '22 at 07:33
  • As for checking, you can use `fileName.StartsWith('~')`. Is the real problem how to avoid such files when using `Directory.EnumeratFiles` perhaps? Unfortunately you can't exclude such files through a search pattern. You can use LINQ's `Where` to make filtering easier. – Panagiotis Kanavos Mar 21 '22 at 07:40
  • You can also use the overload that accepts [EnumerationOptions](https://learn.microsoft.com/en-us/dotnet/api/system.io.directory.enumeratefiles?view=net-6.0#system-io-directory-enumeratefiles(system-string-system-string-system-io-enumerationoptions)) with [IgnoreInaccessible](https://learn.microsoft.com/en-us/dotnet/api/system.io.enumerationoptions.ignoreinaccessible?view=net-6.0#system-io-enumerationoptions-ignoreinaccessible) – Panagiotis Kanavos Mar 21 '22 at 07:40