Index Word/PDF Documents From File System To SQL Server

Question

I'm trying to come up with a simple solution to a problem I have because all of those I have found so far just seem too complicated!

The situation is that we use a proprietary application for managing most aspects of our business. It has an SQL Server 2005 backend database, which is quite large. The application also allows the attaching of Word and PDF documents to records, which we use extensively, and these are stored in the file system on the server, with the filenames referenced in the database. Unfortunately the search facilities in the application are poor, so I'm trying to build my own version.

So far I've got a neat ASP.NET page with a search box which will allow users to enter words to search for, as well as filter their results on other fields, such as department, date, etc. The Stored Procedure I've written in the database looks for the words they're searching for in several different fields in the database. What I'm really aiming for is Google-style 'one search to rule them all' effect, where the user doesn't have to specify where they expect to find the word they're looking for, they will just get hits anywhere that it appears in the database. And this is working.

What I want to add now is the ability for the search to include the text of the documents which are 'attached' to records. They are all either .doc or .pdf files but if I couldn't search the .pdf files it wouldn't be the end of the world.

In my ideal world what I'd do is find some software which would index the folder containing the documents (currently there are around 100,000 of them, averaging about 100k) and populate a table in my existing database with this index so that I could then just include that table in my search. I'd love it to just contain a record for each unique word it indexed and a join table referencing documents in the file system containing that word.

Given that this seems fanciful and there isn't any software that will do this, or anything close to it, as far as I can see, what solution would you recommend? The server already has dtSearch running on it, indexing the very files I'm interested in. However, whilst I could wade through the documentation trying to figure out how to implement a search of this index through my own webpage (which I've started to do, and found heavy going), that would have to be a separate search to the one of the SQL database. I couldn't return results from the file index and the database in a unified way.

So, starting from the ultimate wish of having the indexed words stored in the database, with a view to implementing full-text searching on that, what would anyone suggest?

score 2 · Accepted Answer · answered Aug 11 '11 at 11:02

2

SQL Server has full text search (http://msdn.microsoft.com/en-us/library/ms142571.aspx); this supports both PDF and word files (though with some wrinkles - installation can be a bit tricky). The link is to SQL Server 2008 - but the feature's been presence since SQL Server 2000.

So, super simplistically - your solution would require you to load the documents into SQL Server, and amend your stored proc to query them using the built-in free text search features.

Keeping the file system and database versions of the document synchronized could be a challenge, but other than that, I think the solution should be fairly straightforward.

answered Aug 11 '11 at 11:02

Neville Kuyt

29,247
1
37
52

Interesting. I did think of moving the documents into the database, but not of keeping them there AND in the file system. I've tried to get some idea of the size of record a Word or PDF document would create, without success. I know that Word documents in particularly tend to be extremely bloated so I'm hoping they'd create pretty small records. I also Googled around for a way of bulk-importing documents into an SQL database and having an automatic import routine but didn't find anything on that either. Any suggestions for those things? – Phil Stratford Aug 12 '11 at 08:14
1

When you import the document into SQL Server, it consumes the same amount of space as on the file system (I assume); the full text index, however, will indeed be very efficient. It would be more elegant to store the documents in one place - but that sounds like major surgery. If you can, then you should! I'm not aware of a bulk import script for the documents, but knocking up a custom import app in Powershell or similar shouldn't be too big of a task. Alternatively, you could look at the IFilter specification - this is what search engines use to translate documents into plain text. – Neville Kuyt Aug 12 '11 at 08:53
Thanks for your help, I'll try going down that route and let you know how I get on. – Phil Stratford Aug 15 '11 at 09:20

Index Word/PDF Documents From File System To SQL Server

1 Answers1

Linked