An Algorithm to Determine How Similar Two Sentences Are

Question

A friend of mine had an idea to make a speed reading program that displays words one by one (much like currently existing speed reading programs). However, the program would filter out words that aren't completely necessary to the meaning (if you want to skim something).

I have starting to implement this program, but I'm not quite sure on what the algorithm to get rid of "unimportant" words should be.

My idea is to parse the sentence (I'm currently using Stanford Parser) and somehow assign weights based on how important that word is to the sentence's meaning to each word then start removing words with the with the lowest weights. I will continue to do this, check how "different" the original tree and the new tree is. I will continue to remove the word with the lowest weight until the two trees are too different (I will determine some constant via a "calibration" process that each user goes through once). Finally, I will go through each word of the shortened sentence and try to replace it with a simpler or shorter synonym for that word (again while still trying to retain value).

As well, there will be special cases for very common words like "the," "a," and "of."

For example:

"Billy said to Jane, 'Do you want to go out?'"

Would become:

"Billy told Jane 'want go out?'"

This would retain basically all of the meaning of the sentence but shortened it significantly.

Is this a good idea for an algorithm and if so how would I assign the weights, what tree comparison algorithm should I use, and is inserting the synonyms done in a good place (i.e. should it be done before I try to remove any words)?

Are you dead set on doing it all from scratch or are you ok using libraries eg NLTK or gensim etc? — Paul Rooney, Nov 24 '16 at 14:19
There is [this](http://stackoverflow.com/questions/17022691/python-semantic-similarity-score-for-strings) and a linked duplicate. — Paul Rooney, Nov 24 '16 at 14:23
@PaulRooney Would tf-idf and gensim work well on two sentences that are very similar consistently? I had thought that these were for determining whether 2 differently structured sentences are similar? — Dylan Siegler, Nov 24 '16 at 14:34

score 2 · Accepted Answer · answered Dec 02 '16 at 10:02

You can use the method described in this paper for computing the similarity of two sentences: Corpus-based and Knowledge-based Measures of Text Semantic Similarity

You can remove words until the similarity with the original sentence drops significantly (this is an interesting problem in itself).

You can also check this simplified version of the similarity algorithm here: Wordnet Sentence Similarity

score 1 · Answer 2 · answered Nov 24 '16 at 19:13

1

Assigning weights is a million dollar question there. As a first step, I would identify parts of the sentence (subject-predicate-clause, etc) and the sentence structure (simple-compound-complex, etc) to find "anchor" words that would have highest weight. That should make the rest of the task easier.

answered Nov 24 '16 at 19:13

postoronnim

486
2
10
20

Would "anchor" words be the words which are closest to the root of the tree of the sentence structure? – Dylan Siegler Nov 24 '16 at 19:17
That's where research comes in. However, intuitively, I would say there will be one main word in each part of the sentence and some parts of the sentence will be more significant than others - for example, the subject 's main word cannot be omitted. So yes - if you build your tree around those ideas that should cut down on the amount of work the algorithm needs to do later on. Also, I am thinking that determining context early on would not be a bad idea because same words will have different weights depending on the context. – postoronnim Nov 24 '16 at 20:44

score 1 · Answer 3 · answered Dec 06 '16 at 08:57

Assuming you are using the word embedding as a weighting logic because I can't think of any better way to do it, you can convert phrases into vectors and compare those vectors. Low-weight words such as a, an, the, etc., will be nicely handled in this way.

This tutorial might help you: Phrase2Vec In Practice

An Algorithm to Determine How Similar Two Sentences Are

3 Answers3