What algorithm does StackOverflow use for finding similar questions?

Question

I need to create a help desk for customers in a website I'm building and I love the way StackOverflow finds similar questions. Does anyone know what algorithm the site uses and can you provide any references where I can find one?

related question with answer http://stackoverflow.com/questions/891772/stackoverflow-related-questions-algorithm — Tyler, Apr 24 '13 at 15:56

score 6 · Accepted Answer · answered Apr 24 '13 at 17:53

There is a whole branch of Machine Learning called clustering (a type of unsupervised learning) that deals with such types of problems.

The question becomes a part of a cluster, and other questions in the same cluster (probably in the order of similarity measure distance) are displayed as similar questions.

There are various features that it can use for clustering, some of which may be:

Tags
Words in heading
Words in the text (lesser weight than heading)
Links to other questions/webpages.

and so on.

There may be other formulated features using techniques like text summarization, sentiment analysis, etc., that are used in these kind of problems. Which features are good for which problem depends on the problem.

Other areas where you see these algorithms in action are:

Youtube
Wikipedia
IMDB

and the list continues to infinity.

So what can you do about your problem?

There is no one answer for it. It all depends on your data, and target query. But still, you can

Learn feature engineering aspects of machine learning.
Learn about clustering.

(There are many online courses for these.)

Or

Hire a person who knows this stuff.

score 1 · Answer 2 · answered Apr 24 '13 at 15:54

1

Most likley a weighted match on tags and perhaps a match() or equivilent full text weighted search on title.

Its probably got details of it in meta somewhere or FAQ

answered Apr 24 '13 at 15:54

Dave

3,280
2
22
40

What algorithm does StackOverflow use for finding similar questions?

2 Answers2