1

I'm looking for a way to search in Word documents and show a result of documents that matched the search criteria. I'll try to describe the scenario in more detail here.

On a Windows system i have a bunch of folders. Each folder has alot of Word documents. Now i need an application that can search inside a specific folder for keywords that might occure in those word documents. Something like the FULLTEXT search that MySQL has.

So if i search for the following keywords: microsoft, windows XP then i want it to list every Word document that contains one or more of those keywords.

Ofcourse, the more those keywords appear a document, the higher its rank should be in the resulting list.

Now my question is, is there such a tool out there that does exactly this? Or am i better of writing such a tool myself in C#.NET? If so, to what API's do i have to look?

PS. They are .doc and .docx files.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Vivendi
  • 20,047
  • 25
  • 121
  • 196

4 Answers4

2

Looks like you need a full-blown search engine to me, including parsing, indexing, ranking, search, etc. Probably not very pleasant to implement it yourself... You could have a look at Apache Lucene.

Tudor
  • 61,523
  • 12
  • 102
  • 142
0

There is a tool right under your nose. It's Windows Search and it has an API which should meet your needs perfectly.

You might have to install the filter packs to provide Office-specific indexing if you don't have Office installed.

Tim Rogers
  • 21,297
  • 6
  • 52
  • 68
0

Indexing is available within Windows and can deal with Word documents :

If you want to build your own index, you can use IFilters to extract text from documents : How to extract text from MS office documents in C#

Community
  • 1
  • 1
Guillaume
  • 12,824
  • 3
  • 40
  • 48
  • indexing is usually (or sometimes) turned off by windows users, I wouldn't rely on that. IFilters are also not that realiable so that people are forced to resort to something like Lucene – Ivan G. Aug 21 '12 at 12:42
  • Building a feature that is already part of the OS is rarely a good idea. Does Lucene extract text from doc and docx files ? – Guillaume Aug 21 '12 at 12:50
  • No, you'll have to use an external library to extract text, however it's better than fighting IFilter strangeness. We used Index Server for years and it's dumb. The biggest problems are deployment (installing PDF IFilters and others, the fact that users can switch search off for performance) and poor search quality. It's a really old technology we try to avoid at all costs now. – Ivan G. Aug 21 '12 at 12:56
  • @aloneguid So just a question about Lucene. You say it **can't** extract text from doc and docx files. But it **can** read, search (and rank) through doc files right? I don't really need to extract text from the doc files. All i need to do is search thourgh them for certain keywords and rank them. – Vivendi Aug 21 '12 at 13:10
0

You could try SmartFinder APP available on Microsoft Store.

It's developed with Java and Apache Lucene library.

You can search the text and immediately have an extract of the document with the searched words highlighted in the results. You can refine your search with metadata (authors, keyword, publisher, ...) and you can search also with wildcard (for example with * or ? special chars).

This is the Microsoft Store link to download the APP: https://www.microsoft.com/store/apps/9PD0BCV3WKD1

enter image description here

Simona R.
  • 558
  • 6
  • 20