6

I am trying to implement a feature similar to the "Related Questions" on Stackoverflow.

How do I go about writing the SQL statement that will search the Title and Summary field of my database for similar questions?

If my questions is: "What is the SQL used to do a search similar to "Related Questions" on Stackoverflow".

Steps that I can think of are;

  1. Strip the quotation marks
  2. Split the sentence into an array of words and run a SQL search on each word.

If I do it this way, I am guessing that I wouldn't get any meaningful results. I am not sure if Full Text Search is enabled on the server, so I am not using that. Will there be an advantage of using Full Text Search?

I found a similar question but there was no answer: similar question

Using SQL 2005

Community
  • 1
  • 1
Picflight
  • 3,832
  • 13
  • 61
  • 90

8 Answers8

7

Check out this podcast.

One of our major performance optimizations for the “related questions” query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster.

Community
  • 1
  • 1
Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
  • 1
    You don't want to remove too many 'common' words - 'net, class, switch, case, data' are all common. First pass would probably to search for keyword sin your question and lookup questions with similair keywords – Martin Beckett Jun 01 '09 at 22:46
  • This just supports my (inexplicably downvoted) answer: it's not about the SQL, it's about what's done before and after the SQL. – Instance Hunter Jun 02 '09 at 16:55
1

After enabling Full Text search on my SQL 2005 server, I am using the following stored procedure to search for text.

ALTER PROCEDURE [dbo].[GetSimilarIssues] 
(
 @InputSearch varchar(255)
)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

DECLARE @SearchText varchar(500);

SELECT @SearchText = '"' + @InputSearch + '*"'

SELECT  PostId, Summary, [Description], 
Created
FROM Issue

WHERE FREETEXT (Summary, @SearchText);
END
Picflight
  • 3,832
  • 13
  • 61
  • 90
1

They probably relate based on tags that are added to the questions...

Ropstah
  • 17,538
  • 24
  • 120
  • 194
  • 4
    The related questions dialog pops up when you enter a title, not tags. – Dan Lew Jun 01 '09 at 22:14
  • True, they search the title for 'unique' words. They probably filter out words like: "how","the","what" etc. and search the rest of the words in other titles while they don't take word positions into account. But they also probably search those unique words in tags, those matches are probably higher rated in search results. – Ropstah Jun 01 '09 at 22:40
0

I'm pretty sure it would be most efficient to implement the feature based on the tags associated with each post.

Christopher Garcia
  • 2,536
  • 7
  • 30
  • 40
0

It's probably done using a full text search which matches like words/phrases. I've used it in MySQL and SQL Server with decent success with out of the box functionality.

You can find more on MySQL full text searches at:

http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html

Or just google Full Text search and you will find a lot of information.

Ryan Smith
  • 8,344
  • 22
  • 76
  • 103
0

It looks keyword based on the title you enter, queried against titles and content of other questions. Probably easier (and more appropriate) to do in Lucene (or similar) then in a relational database.

Gandalf
  • 9,648
  • 8
  • 53
  • 88
0

I'd say it's probably a full text search on the question title and the question content and answers as well using the individual words (not the whole title) you enter. Then, using the ranking features of full-text, the top 10 or so questions that rank the highest are displayed.

As tydok pointed out, it looks like they are using full-text searching (I couldn't imagine any other way).

Here's the MSDN reference on Full-Text Searching, nailing the specific query used probably isn't going to happen.

scottm
  • 27,829
  • 22
  • 107
  • 159
-4

The SQL very well may be just "SELECT * FROM questions;". I find it hard to imagine that the algorithm for finding similar questions is implemented in SQL.

Instance Hunter
  • 7,837
  • 5
  • 44
  • 56