4

So I am currently trying to convert a word doc (.doc) into a text document because I want to use regular expressions on it to find things in the document. So I came up with the below and it converts the word document into a rich text format (by appending it to a rich text box), but this does not translate into a plain text format. When I tried with regular text document it printed every word on a new line. I have not been able to find any information on how to do this in C#. I'm using C# and visual studio 2010.

I do not expect any special characters in the document (like bold, underlines, etc.), but if someone knows how I can be robust and extract those that would be super awesome.

I want it as a text document because there's several methods I know I can use on regular text, but I doubt they would work on word text due to hidden/special characters that come with word docs.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using Microsoft.Office.Interop.Word;

namespace ReadWordDocProject
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            string testFile = @"C:\Users\<mycomputer>\Documents\TestItemHelpers\TestWordDoc.docx";

            Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
            Document document = application.Documents.Open(testFile);//path here

            int count = document.Words.Count;
            for (int i = 1; i <= count; i++)
            {
                string text = document.Words[i].Text;
                //Do output with text here
                richTextBox1.AppendText(text);
            }

            ((_Application)application).Quit(); //cast as _Application because there's ambiguity 
        }


    }
}
I'm with Monica
  • 328
  • 8
  • 19
user3003304
  • 288
  • 1
  • 6
  • 18
  • "When I tried with regular text document it printed every word on its a new line" What was the code you tried here? – Ben Aaronson May 15 '14 at 19:38
  • 1
    As a non-programming solution, have you tried copying the entire document contents from within Word and pasting them into a text editor? If this is just a one-off task, that's surely the quickest route to a text-only document. – adv12 May 15 '14 at 19:41
  • I'll have a lot of files like this coming in and it seems a little impratical to do it by hand. I know how to do it by hand, but I was hoping for an easier solution. – user3003304 May 15 '14 at 20:14
  • @BenAaronson I did a write line by line to a text doc just to test and see if it would work. Do you think some special characters in word doc could have translated a text equivalent line to a word doc's single word?... – user3003304 May 15 '14 at 20:19

1 Answers1

4

Microsoft says you shouldn't use Microsoft Office Interop to manipulate documents in an automated application.

You can use a free library like Spire Doc to convert a Word Doc to TXT, then open the txt file. I think there is a way to save directly to MemoryStream from Spire, but I'm not sure. (I know there is in Aspose Words, but that isn't free).

private void button1_Click(object sender, EventArgs e)
{
    //Open word document
    Document document = new Document();
    string docPath = @"C:\Users\<computer name>\Documents\TestItemHelpers";

    document.LoadFromFile(Path.Combine(docPath,"TestWordDoc.docx"));

    //Save doc file.
    document.SaveToFile(Path.Combine(docPath,"TestTxt.txt"), FileFormat.Txt);

    string readText = File.ReadAllText(Path.Combine(docPath,"TestTxt.txt"));

    //do regex here
}

Edit: If you're going to use Interop because it is okay for user-run activities (as pointed out in comments), you can save the document as a text file then do the regex:

private void button1_Click(object sender, EventArgs e)
{
    string docPath = @"C:\Users\<computer name>\Documents\TestItemHelpers"
    string testFile = "TestWordDoc.docx";

    Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
    Document document = application.Documents.Open(Path.Combine(docPath,testFile );
    application.ActiveDocument.SaveAs(Path.Combine(docPath,"TestTxt.txt"), WdSaveFormat.wdFormatText, ref noEncodingDialog);
    ((_Application)application).Quit();

    string readText = File.ReadAllText(Path.Combine(docPath,"TestTxt.txt"));

    //do regex here
}
CarenRose
  • 1,266
  • 1
  • 12
  • 24
  • 2
    Your first link only applies to *server-side* processing. It is perfectly fine for user-run applications. – crashmstr May 15 '14 at 19:52
  • My program may get used for server side work so this might actually be perfect for me. – user3003304 May 15 '14 at 20:15
  • I added the Interop SaveAs just in case you were interested in looking at that way too. – user1914368 May 15 '14 at 20:35
  • So i looked at the spire doc stuff, but the free version will only read up to 100 paragraphs, and that probably won't work for my purposes. – user3003304 May 16 '14 at 17:41
  • 1
    There are other paid libraries that will read more than 100 paragraphs, but you can use this for testing. If you are going to use this on a server then you will definitely want to use something other than MS Word Interop. – user1914368 May 16 '14 at 18:45
  • If you're just going to be working with docx, another way to manipulate the document is OpenXML. – user1914368 May 16 '14 at 18:53
  • Spire hasn't been updated in a while, is there anything more current? – gillonba Sep 04 '15 at 15:08