3

I develop a web application that let users to upload files like images and documents. this file divided into two parts :

  1. binary files
  2. document files

I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?

mshirdel
  • 88
  • 1
  • 13

3 Answers3

2

You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.

For more information look at this:

http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx

Pros: The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.

also you can enable versioning of files and also make it easier for load balanced web farms.

Cons: you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.

Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.

Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

Community
  • 1
  • 1
Baz1nga
  • 15,485
  • 3
  • 35
  • 61
  • We can extract textual information from pdf files in varbinary fields: http://weblogs.asp.net/wallym/archive/2005/02/28/382060.aspx. But this link describe this in SQL Server 2005 – mshirdel Jan 08 '12 at 07:23
  • the same should be possible in SQL server 2008.. I am sure they wouldnt have stopped supporting the same – Baz1nga Jan 08 '12 at 07:29
  • How effective is this method? If the files may be large. – mshirdel Jan 08 '12 at 07:43
  • I create full-text search index on varbinary(max) column and insert file content into this column. I create an extension column for specify content type.but when I query this table with contains, no result were showing.why? – mshirdel Jan 09 '12 at 14:07
0

You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.

For docx files, I would convert them to RTF and search them that way while stored in SQL.

For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.

Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.

JustBeingHelpful
  • 18,332
  • 38
  • 160
  • 245
  • Thank you for your guidance.Do you mean that I should convert docx files to rtf files before uploading files or my application do this? And what is your recommend for other file types like : xls, txt, and etc. – mshirdel Jan 08 '12 at 05:40
  • There might be a way to keep them in the docx format, but I don't know of a way. RTF just contains tags and text, so if you want to search text, you could just parse though it and strip all the tags out. I'm not sure of the best method of XLS. Never done that before. – JustBeingHelpful Jan 08 '12 at 20:23
0

The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.

In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.

these are two article about it:

SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search

Using Full Text Search in SQL Server 2008

Pars.Engineer
  • 57
  • 2
  • 8