1

I want to read data - like string, from .docx file from C# code. I look through some of the issues but didn't understand which one to use.

I'm trying to use ApplicationClass Application = new ApplicationClass(); but I get t

Error:

The type 'Microsoft.Office.Interop.Word.ApplicationClass' has no constructors defined

And I want to get full text from my docx file, NOT SEPARATED WORDS !

foreach (FileInfo f in docFiles)
{
    Application wo = new Application();
    object nullobj = Missing.Value;
    object file = f.FullName;
    Document doc = wo.Documents.Open(ref file, .... . . ref nullobj);
    doc.Activate();
    doc. == ??    
}

I want to know how can I get whole text from docx file?

Balagurunathan Marimuthu
  • 2,927
  • 4
  • 31
  • 44
Big.Child
  • 2,948
  • 4
  • 19
  • 26
  • I hope this will helps you https://forums.asp.net/p/1688845/4463018.aspx/1?Re+Read+doc+or+docx+file+with+formatting – Glory Raj Jun 11 '12 at 07:35
  • [This constructor supports the .NET Framework infrastructure and is not intended to be used directly from your code.](http://msdn.microsoft.com/library/microsoft.office.interop.word.applicationclass.applicationclass(v=office.11)) – Joey Jun 11 '12 at 07:35
  • 1
    `docx` is actually `Zip`. You can unzip using `sharpziplib` and read `word\document.xml` or get picture using media directory. – ebattulga Jun 11 '12 at 07:40
  • 1
    In fact it is more than a zip, it is an OPC package. If you want to start manually reading the files inside the archive, use System.Packaging – Anders Forsgren Jun 11 '12 at 07:45
  • I have console application, I'm using it like a service. So I cant use asp solutions – Big.Child Jun 11 '12 at 07:51
  • 2
    If you're running unattended on a server, you shouldn't use the Office Automation libraries. They do not play well with headless servers, may occasionally pop up dialogs, and so forth. For post 2007 formats, your best bet is using the System.Packaging namespace, as @AndersForsgren said. – Avner Shahar-Kashtan Jun 11 '12 at 08:22
  • I think ZipPackage is the most suitable solution for me,now I'm trying to extract text, but I can't do yet. I hope I'll do.Thanks for advice. – Big.Child Jun 11 '12 at 08:37
  • and how can I get whole text from my docx file ? – Big.Child Jun 12 '12 at 06:42

5 Answers5

4

This Is what I want to extract whole text from docx file !

    using (ZipFile zip = ZipFile.Read(filename))
{
    MemoryStream stream = new MemoryStream();
    zip.Extract(@"word/document.xml", stream);
    stream.Seek(0, SeekOrigin.Begin); 
    XmlDocument xmldoc = new XmlDocument();
    xmldoc.Load(stream);
    string PlainTextContent = xmldoc.DocumentElement.InnerText;
}
Big.Child
  • 2,948
  • 4
  • 19
  • 26
3

try

Word.Application interface instead of ApplicationClass. 

Understanding Office Primary Interop Assembly Classes and Interfaces

santosh singh
  • 27,666
  • 26
  • 83
  • 129
  • I said ApplicationClass wordApp = new ApplicationClass(); gets me error – Big.Child Jun 11 '12 at 07:47
  • Because you can't do that. You should use Application, instead, as geek told you. I.e. `Word.Application = new Word.Application();` – Francesco Baruchelli Jun 11 '12 at 09:09
  • `Application wo = new Application(); object nullobj = Missing.Value; object file = f.FullName; Document doc = wo.Documents.Open(ref file, .... . . ref nullobj); doc.Activate(); doc.` => I want to knwo how can I get whole text from docx file ?` – Big.Child Jun 11 '12 at 13:56
  • check out these links http://stackoverflow.com/questions/1296743/how-can-i-query-a-word-docx-in-an-asp-net-app http://stackoverflow.com/questions/1492738/how-to-extract-plain-text-from-a-docx-file-using-the-new-ooxml-support-in-apache – santosh singh Jun 11 '12 at 16:40
  • In which library can I find XWPFWordExtractor ? – Big.Child Jun 12 '12 at 06:34
  • But I think that link are not for c# – Big.Child Jun 12 '12 at 06:42
0

The .docx format as the other Microsoft Office files that end with "x" is simply a ZIP package that you can open/modify/compress.

So use an Office Open XML library like this.

Matteo Migliore
  • 925
  • 8
  • 22
0

Enjoy.

Make sure you are using .Net Framework 4.5.

using NUnit.Framework;
    [TestFixture]
    public class GetDocxInnerTextTestFixture
    {
        private string _inputFilepath = @"../../TestFixtures/TestFiles/input.docx";

        [Test]
        public void GetDocxInnerText()
        {
            string documentText = DocxInnerTextReader.GetDocxInnerText(_inputFilepath);

            Assert.IsNotNull(documentText);
            Assert.IsTrue(documentText.Length > 0);
        }
    }

using System.IO;
using System.IO.Compression;
using System.Xml;
    public static class DocxInnerTextReader
    {
        public static string GetDocxInnerText(string docxFilepath)
        {
            string folder = Path.GetDirectoryName(docxFilepath);
            string extractionFolder = folder + "\\extraction";

            if (Directory.Exists(extractionFolder))
                Directory.Delete(extractionFolder, true);

            ZipFile.ExtractToDirectory(docxFilepath, extractionFolder);
            string xmlFilepath = extractionFolder + "\\word\\document.xml";

            var xmldoc = new XmlDocument();
            xmldoc.Load(xmlFilepath);

            return xmldoc.DocumentElement.InnerText;
        }
    }
sapbucket
  • 6,795
  • 15
  • 57
  • 94
0

First you need to add some references from assemblies such as:

System.Xml
System.IO.Compression.FileSystem

Second you should be certain of calling these using in your class:

using System.IO;
using System.IO.Compression;
using System.Xml;

Then you can use below code:

public string DocxToString(string docxPath)
{
    // Destination of your extraction directory
    string extractDir = Path.GetDirectoryName(docxPath) + "\\" + Path.GetFileName(docxPath) + ".tmp";
    // Delete old extraction directory
    if (Directory.Exists(extractDir)) Directory.Delete(extractDir, true);
    // Extract all of media an xml document in your destination directory
    ZipFile.ExtractToDirectory(docxPath, extractDir);

    XmlDocument xmldoc = new XmlDocument();
    // Load XML file contains all of your document text from the extracted XML file
    xmldoc.Load(extractDir + "\\word\\document.xml");
    // Delete extraction directory
    Directory.Delete(extractDir, true);
    // Read all text of your document from the XML
    return xmldoc.DocumentElement.InnerText;
}

Enjoy...

MiMFa
  • 981
  • 11
  • 14