How to extract text from MS office documents in C#

Question

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

KyleM · Answer 1 · 2011-12-28T18:27:09.767

47

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)
    {
        const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
        {
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);
            xdoc.Load(wdDoc.MainDocumentPart.GetStream());

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)
                {
                    textBuilder.Append(textNode.InnerText);
                }
                textBuilder.Append(Environment.NewLine);
            }

        }
        return textBuilder.ToString();
    }

edited Dec 28 '11 at 18:27

answered Dec 28 '11 at 18:21

KyleM

4,445
9
46
78

5

@adrianbanks I feel that this answer is *currently* better than the accepted answer because the accepted answer will not work on certain versions of Windows and because IFilter is a deprecated interface. Of course at the time adrian's post was written that was not the case. – KyleM Dec 28 '11 at 18:24
5

What about SPFile? The argument you are putting in the function is of this type and all I could found about it is Microsoft.Sharepoint namespace in Microsoft.Sharepoint.dll -> and this dll is not easy to find. What have you referenced to get SPFile? – FrenkyB Sep 30 '13 at 10:33
1

@user867703 You don't have to use SPFile. It was an example. You can use any .docx file (opened as a binary stream). Look at the WordprocessingDocument.Open method, that's the important method. – KyleM Sep 30 '13 at 15:12
5

I simply changed SPFile to path (string) and in open method I've used just path -> it works. Solution is very clear and simple. – FrenkyB Sep 30 '13 at 19:15
@KyleM This doesn't looks like working for me on a 64 bit system. I can't find the DocumentFormat.OpenXML dll for 64 bit system. Adding 32 bit doesn't works. Or I am doing something wrong? – Maxsteel Oct 18 '13 at 13:58
@Maxsteel Well your application will have to be run in 32 bit mode. A 64 bit process cannot load a 32 bit .dll. All assemblies loaded by a particular process must conform to the "bit-ness" of that process – KyleM Oct 18 '13 at 16:01
@KyleM Hey thanks for the reply. Turns out I just had to change the framework from 2.0 to 3.5. And it does works on my 64 bit project, just to confirm. Thanks anyway :) – Maxsteel Oct 19 '13 at 09:24
@Maxsteel Glad you got it working but what I said was correct. See http://stackoverflow.com/questions/2265023/load-32bit-dll-library-in-64bit-application – KyleM Oct 20 '13 at 06:13
1

In the OpenXML package you need to import: `DocumentFormat.OpenXml.Packaging` `DocumentFormat.OpenXml.Wordprocessing` And you need to reference `WindowsBase.dll` for it to work. Other than that; nice solution. – Kristian Barrett Dec 09 '14 at 14:57
1

@KristianBarrett Thanks. If you reference the DLL I mentioned in the post, I think Visual Studio will tell you which packages to import. It's been a while though, so thanks for the exact imports for anyone who needs them. – KyleM Dec 10 '14 at 16:45

score 27 · Accepted Answer · answered Jun 18 '09 at 08:28

27

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

answered Jun 18 '09 at 08:28

adrianbanks

81,306
22
176
206

Interesting... a very sneaky solution :) – Skurmedel Jun 18 '09 at 09:05
Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why. – adrianbanks Jun 18 '09 at 11:03
2

I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this http://www.codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en – Elias Haileselassie Jun 19 '09 at 08:25
Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (http://www.adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)] – adrianbanks Feb 22 '10 at 17:15
2 quick Qs - a) I am currently using the method outlined here - http://www.codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different? b) In the IFilter method you linked, the author does a: TextReader reader=new FilterReader(fileName); I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side: Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes; – Nick Feb 22 '10 at 17:31
@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter. – adrianbanks Feb 22 '10 at 21:59
Please post a sample of invoking an iFilter using pInvoke. – paparazzo Dec 27 '11 at 16:55

score 18 · Answer 3 · answered Nov 23 '15 at 02:05

18

Tika is very helpful and easy to extract text from different kind of documents, including microsoft office files.

You can use this project which is such a nice piece of art made by Kevin Miller http://kevm.github.io/tikaondotnet/

Just simply add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/

and then, this one line of code will do the magic:

var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx  / pdf  / .... ").Text;

answered Nov 23 '15 at 02:05

Dina

937
9
12

1

This is the package you need: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/ – Russell Horwood Apr 26 '17 at 11:36
6

Worth noting here that this actually runs Apache Tika (java) through IKVM which is a .net runtime for java, so it's not a light-weight solution. (40MB of binaries, basically a whole java runtime) – caesay Feb 28 '18 at 23:49

Jordan · Answer 4 · 2014-07-14T11:54:58.740

Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:

    public static string ReadAllTextFromDocx(FileInfo fileInfo)
    {
        StringBuilder stringBuilder;
        using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
        {
            NameTable nameTable = new NameTable();
            XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
            xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

            string wordprocessingDocumentText;
            using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
            {
                wordprocessingDocumentText = streamReader.ReadToEnd();
            }

            stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);

            XmlDocument xmlDocument = new XmlDocument(nameTable);
            xmlDocument.LoadXml(wordprocessingDocumentText);

            XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
            foreach(XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
                foreach(XmlNode textNode in textNodes)
                {
                    switch(textNode.Name)
                    {
                        case "w:t":
                            stringBuilder.Append(textNode.InnerText);
                            break;

                        case "w:tab":
                            stringBuilder.Append("\t");
                            break;

                        case "w:br":
                            stringBuilder.Append("\v");
                            break;
                    }
                }

                stringBuilder.Append(Environment.NewLine);
            }
        }

        return stringBuilder.ToString();
    }

How to do you extract images if there is one inside the w:p? — Shuaib, May 11 '17 at 12:41
Note: You will need to add a reference to DocumentFormat.OpenXml and add this: using DocumentFormat.OpenXml.Packaging; — Jeff, Feb 13 '23 at 20:00

score 11 · Answer 5 · answered Oct 19 '16 at 02:57

11

Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

Then do whatever you want with the words.

answered Oct 19 '16 at 02:57

Chris

491
5
9

1

Ah, brilliant my friend. This should now be the accepted answer, the rest are outdated. – Hugo Nava Kopp Oct 27 '16 at 11:09
1

This is very easy, but also very slow solution. Open XML is "thousands" times faster. – buks Nov 04 '16 at 16:25
3

_It's free_ - doesn't it require you to have Word installed? – Matt Burland Jan 04 '19 at 16:36
2

@Chris: And appart from Matt Burland's catch22, how do I run this on a Linux server ? ;) – Stefan Steiger Apr 26 '19 at 16:43

score 6 · Answer 6 · answered Sep 15 '16 at 16:40

A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)

using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;

public static class DocxTextExtractor
{
    public static string Extract(string filename)
    {
        XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
        NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

        using (var archive = ZipFile.OpenRead(filename))
        {
            return XDocument
                .Load(archive.GetEntry(@"word/document.xml").Open())
                .XPathSelectElements("//w:p", NsMgr)
                .Aggregate(new StringBuilder(), (sb, p) => p
                    .XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
                    .Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
                    .Aggregate(sb, (sb1, v) => sb1.Append(v)))
                .ToString();
        }
    }
}

This looks like a great solution, but I'm unable to make this work since I'm getting an error: `Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory.` — Hugo Nava Kopp, Mar 24 '17 at 16:26
That message seems to be a `ZipFile` notion of a zip file (i.e. docx file in this case) being corrupt... — lxa, Mar 26 '17 at 19:18

score 2 · Answer 7 · answered Jun 18 '09 at 07:38

2

Simple!

These two steps will get you there:

1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

answered Jun 18 '09 at 07:38

joshcomley

28,099
24
107
147

3

Only sad part with the Office interop library is that you need to have Office installed. – Skurmedel Jun 18 '09 at 07:40
1

`Interop` is usable, but should be avoided if possible. – Tun Oct 21 '11 at 03:16
Microsoft Word 12.0 Object Library --> This is not in my Add Reference list on the Add Reference right click. Is there another way that Microsoft Word 12.0 Object Library has to be entered so that I can read in a word document. – Doug Hauf Dec 19 '13 at 21:50
Interop not working in godaddy hosting. Godday not support Office. – Hardik Mandankaa Jun 17 '16 at 04:57

Skurmedel · Answer 8 · 2009-06-18T10:24:30.550

1

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

edited Jun 18 '09 at 10:24

answered Jun 18 '09 at 07:25

Skurmedel

21,515
5
53
66

this is really greate! I am done with docx, and what about for the rest? – Elias Haileselassie Jun 18 '09 at 09:22
You can "connect" to a xslx-file like it were a database with ODCB I think. A quite cumbersome solution I think. I have no idea on how to read .doc-files or .xls-files, so I can't help you there. Here is a reference for .xls files though: http://sc.openoffice.org/excelfileformat.pdf – Skurmedel Jun 18 '09 at 10:32
I couldn't find anything better on XLSX than the specification itself sadly: http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%201%20(PDF).zip – Skurmedel Jun 18 '09 at 10:37

score 0 · Answer 9 · answered Jun 23 '17 at 16:51

If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.

I used Spire.Doc, worked beautifully. Spire.Doc download It even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.

Erik Felde ,could you give some example for asp.net on Spire.Doc — Maksud, Sep 25 '18 at 03:52

score 0 · Answer 10 · answered Oct 09 '19 at 10:18

One of the suitable options for extracting text from Office documents in C# is GroupDocs.Parser for .NET API. The following are the code samples for extracting simple as well as formatted text.

Extracting Text

// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

Extracting Formatted Text

// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the document
        // If formatted text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
    }
}

Disclosure: I work as Developer Evangelist at GroupDocs.

How to extract text from MS office documents in C#

10 Answers10

Linked