9

Does anyone know of a good way to calculate the "semantic distance" between two words?

Immediately an algorithm that counts the steps between words in a thesaurus springs to mind.


OK, looks like a similar question has already been answered: Is there an algorithm that tells the semantic similarity of two phrases.

Community
  • 1
  • 1
Ben Aston
  • 53,718
  • 65
  • 205
  • 331
  • for most arbitrary word pairs, there will be no "Kevin Bacon" path between them, and this distance will be infinite. Is that what you want? – Die in Sente Dec 30 '08 at 00:43
  • I was considering such an algorithm to cluster user reputation in "domains" automatically, so users achieve higher permissions on a domain by domain basis. e.g. a user might be an expert on "sailing", so my system would give that user more permissions on sailing related items. – Ben Aston Dec 30 '08 at 13:20

3 Answers3

5

In text mining there is an important maxim: "You shall know a word by the company it keeps". It means that it is possible to learn the meaning of a word based on the terms that frequently appear close to it.

Without entering in extensive details, let me give two simple options to estimate semantic distance between terms:

  1. Use a resource similar to WordNet (a large lexical database of English). WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. The semantic distance between words can be estimated as the number of vertices that connect the two words.

  2. Using a large corpus (e.g. Wikipedia), count the terms that appear close to the words you are analyzing. Create two vector and compute a distance (e.g cosine).

You can check this materials to get a get picture about the subject:

  1. http://www.saifmohammad.com/WebDocs/Mohammad_Saif_Thesis-slides.pdf

  2. http://www.umiacs.umd.edu/~saif/WebDocs/distributionalmeasures.pdf

  3. http://www.umiacs.umd.edu/~saif/WebDocs/Measuring-Semantic-Distance.pdf

mariolpantunes
  • 1,114
  • 2
  • 15
  • 28
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – John Doyle Mar 06 '14 at 15:20
4

The thesaurus idea has some merit. One idea would be to create a graph based on a thesaurus with the nodes being the words and an edge indicating that there they are listed as synonyms in the thesaurus. You could then use a shortest path algorithm to give you the distance between the nodes as a measure of their similarity.

One difficulty here is that some words have different meanings in different contexts. Your algorithm may need to take this into account and use directed links with the weight of the outgoing link dependent on the incoming link being followed (or ignore some outgoing links based on the incoming link).

tvanfosson
  • 524,688
  • 99
  • 697
  • 795
  • Thanks. Yeah, it's a tricky problem, but with your expansion on my thesaurus idea and by focussing on a subset of the language (e.g. just nouns) then intuitively it sounds possible. I don't have the time to implement such a system right now though. – Ben Aston Dec 30 '08 at 13:40
  • 1
    Thesaurii don't really form graphs, though. Each entry is a "synset" - a set of synonyms, where all words in the set have equivalent meanings. If a word appears in multiple synsets, it's because that word has multiple meanings - so it's not very useful to draw an edge between the two synsets. – Nick Johnson May 31 '10 at 22:08
  • @Nick - that's not really my area of expertise, but I can see that it would be difficult to construct an accurate graph since words within the entry itself may be closer or further from the target based on semantics. Perhaps using multiple thesauri and adding 1 for each theasurus that contains the pair in a synset. – tvanfosson May 31 '10 at 22:18
  • 1
    What I mean to say is that when the same set of characters ('word') appears in two different synsets, it's not really the same word - it's a different one spelled the same way, or at the very least a different sense. For example, the "mine" in ["mine", "deposit", "supply"] is not the same word as "mine" in ["mine", "dig up"] and not the same as the one in ["mine", "yours"] - so it doesn't make sense to have an edge between them. Without edges between synsets, you just have a large set of small, disjoint graphs. – Nick Johnson May 31 '10 at 23:23
  • @Nick, again not an expert, but aren't they typically grouped by meaning. Couldn't you use the common words between sets to identify how to choose which set from a word to use in creating the graph? You'd have to identify a word/meaning pair and link those, not merely the words. – tvanfosson Jun 01 '10 at 03:18
  • Right, they're grouped by meaning - so two instances of the same word in different synsets mean different things. You could find synsets that share more than one word in common, but you'd still be conflating different meanings if you drew a link between them. – Nick Johnson Jun 01 '10 at 07:26
0

Possible hack: send the two words to Google search, and return the # of pages found.

Die in Sente
  • 9,546
  • 3
  • 35
  • 41
  • 2
    @Ben - Essentially, what this does is count the number of documents that the words have in common. For words that are highly selective this might have some merit, but for words that aren't good document discriminators, it's possible that you may get a zero correlation for words that are very closely related. – tvanfosson May 31 '10 at 13:27