7

I need to create a help desk for customers in a website I'm building and I love the way StackOverflow finds similar questions. Does anyone know what algorithm the site uses and can you provide any references where I can find one?

Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83
DreamWave
  • 1,934
  • 3
  • 28
  • 59
  • 2
    related question with answer http://stackoverflow.com/questions/891772/stackoverflow-related-questions-algorithm – Tyler Apr 24 '13 at 15:56

2 Answers2

6

There is a whole branch of Machine Learning called clustering (a type of unsupervised learning) that deals with such types of problems.

The question becomes a part of a cluster, and other questions in the same cluster (probably in the order of similarity measure distance) are displayed as similar questions.

There are various features that it can use for clustering, some of which may be:

  • Tags
  • Words in heading
  • Words in the text (lesser weight than heading)
  • Links to other questions/webpages.

and so on.

There may be other formulated features using techniques like text summarization, sentiment analysis, etc., that are used in these kind of problems. Which features are good for which problem depends on the problem.

Other areas where you see these algorithms in action are:

  • Youtube
  • Wikipedia
  • IMDB

and the list continues to infinity.

So what can you do about your problem?

There is no one answer for it. It all depends on your data, and target query. But still, you can

  • Learn feature engineering aspects of machine learning.
  • Learn about clustering.

(There are many online courses for these.)

Or

  • Hire a person who knows this stuff.
Sailesh
  • 25,517
  • 4
  • 34
  • 47
1

Most likley a weighted match on tags and perhaps a match() or equivilent full text weighted search on title.

Its probably got details of it in meta somewhere or FAQ

Dave
  • 3,280
  • 2
  • 22
  • 40