-3

I am trying to get an algorithm that tells you what percentage of similarity there is between two sentences. I was thinking of creating a vector of chars. for each char there is in the sentence compare to all of the other chars in another sentence. then the amount of characters that are the same over the total amount of characters should give me that %... but If you guys have a faster, and more efficient way of doing this. then it would be greatly appreciated.

  • You are supposed to pick only one tag language. Your question will probably get closed for that – jhamon Apr 20 '18 at 13:53
  • 2
    Define "similar" - if someone uses puppy rather than dog, do you punish them for all the letters that puppy has offset? If someone uses kill rather than hug do you want to highlight the difference in meaning? – UKMonkey Apr 20 '18 at 13:59
  • It looks like you started thinking about an algorithm for calculating something before defining what that something is. First decide what *exactly* you mean by "percentage of similarity". – molbdnilo Apr 20 '18 at 14:06

2 Answers2

0

What you are looking for might be an algorithm such as the Vector Space Model [wiki link]. It is a common algorithm web search engines use to come up with relevant sites to strings, that users put in.

It is not the only algorithm that does this kind of thing (comparing text and giving a value for similarity), but most of them are not overly complicated and there are already libraries in C++, which implement them effectively for example Lucene or Xapian. If you skip through their docs you will almost certainly find a function that just takes two strings and gives back a scalar representation of their semantic similarity.

nada
  • 2,109
  • 2
  • 16
  • 23
0

You could use the Levenshtein distance to work out the similarity between the two strings - see https://en.m.wikipedia.org/wiki/Levenshtein_distance for more info

Ian4264
  • 307
  • 2
  • 6