0

I know we can measure the "sameness" in signal using cross-corellation, but how do we calculate the percentage of "sameness" in text?

for example we have: 1. "The Legend of Awesome Dog" 2. "Dog Awesome The legend of" which is like 100% same but shuffled.

but when paired with : 3. "Dog awesome number 9" which only got 40% sameness with sentence 1 or 2.

Leon
  • 69
  • 2
  • 11

1 Answers1

0

You are looking for aproximate string matching. There is a free add-on for Excel, developed by Microsoft to create a so called Fuzzy match. It uses the Jaccard index algorithm to determine the similarity of two given values.

  • Make sure that both columns are a table (Ctrl+L);
  • Link the columns in the 'Left Columns' and the 'Right Columns' section and press the connect button in the middle;
  • Select which columns you want as output (hold Ctrl if you want to select multiple columns on either the left or the right side);
  • Make sure the FuzzyLookup.Similarity is checked;
  • Determine the maximum number of matches shown per comparable string;
  • Determine your Threshold. The number represents the minimum percentage of similarity between two strings before it marks it as a match;
  • Go to a new sheet to cell A1;
  • Hit the 'Go'button!
  • Select all the similarity scores and give them more decimals for a proper result.

See example.

marcuse
  • 3,389
  • 3
  • 29
  • 50