0

Is there some sort of algorithm out there or concept that can help with the following problem?

Say I have two snippets of text, snippet 1, and snippet two.

Snippet 1 reads as follows:

"The dog was too scared to go out into the storm"

Snippet 2 reads as follows:

"The canine was intimidated to venture into the rainy weather"

Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.

UPDATE: Okay, to give a more specific example, say I wanted to reduce the number of bugs in a ticketing system. And I wanted to do some sort of scan, to see if there are any related or similar tickets. I wanted to know the best systematic way of determining the issue based on the body of a ticket. The Levenshtein Distance algorithm doesn't particularly work well, since it wouldn't know the difference between wet and dry.

ddeamaral
  • 1,403
  • 2
  • 28
  • 43
  • This repository has a word-embedding approach to your question: https://bitbucket.org/yunazzang/aiwiththebest_byor – aerin Mar 19 '17 at 05:31

2 Answers2

3

Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.

Well, this is a very famous problem in NLP, and to be more precise, you are comparing semantics of two sentences. Maybe you can look into libraries like gensim, Wordnet::Similarity etc which provide ways to retrieve semantically similar documents.

Here's another semantically similar SO question question.

Community
  • 1
  • 1
nishantbhardwaj2002
  • 757
  • 2
  • 6
  • 18
1

An option here could be the Levenshtein Distance between two strings. It is a measure of the number of operations required to get from one string to another. So, the larger the distance, the less similar the two strings.

This kind of algorithm is great for spell checking or voice recognition because the given string and expected string generally only differ by just a couple words/characters.

For your example, the Levenshtein Distance is 32 (you can try this calculator) which indicates that the strings are not very similar (since the strings are not much longer than the distance of 32).

This algorithm is not great for context sensitive comparisons but your example is kind of an extreme case. Very likely there would be more words in common which would result in a smaller Levenshtein Distance. You could use this algorithm in conjunction with some other methods (See: What are some algorithms for comparing how similar two strings are?) to try to get a more optimal comparison.

Community
  • 1
  • 1
Justin Hellreich
  • 575
  • 5
  • 15