0

I have list of strings and a strings that look like this :

    mylist = ["the yam is sweet", "what is the best time to come", "who ate my food", "no empty food on the table", "what can I do to make you happy"]  # about 20k data
    myString1 = "Is yam a food"  # String can be longer than this
    myString2 = "should I give you a food"
    myString3 = "I am not happy"

I want to compare each of the myString to each string in my list and collect the percentage of similarity in three different lists. So the end result will look like this:

   similar_string1 = [70, 0.5, 50, 55, 2]
   similar_string2 = [50, 0.5, 70, 85, 2]
   similar_string3 = [20, 15, 0, 5, 80]

So mystring1 will be compare to each string in mylist and calculate the percentage similarity. Same with myString2 and myString3. Then collect each of those percentage in a list as seen above.

I read that one can use TF-IDF to vectorize mylist and mystring, then use cosine similarity to compare them, but I never work on something like this before and I will love if anyone has an idea, process or code that will help me get started.

Thanks

Artyom Vancyan
  • 5,029
  • 3
  • 12
  • 34
Eniola
  • 133
  • 10

1 Answers1

0

A python implementation to get cosine similarity has already been discussed in Calculate cosine similarity given 2 sentence strings

You can check above link and use below code snippet:

'''
vector1 = text_to_vector(myString1)
vector2 = text_to_vector(myString2)
vector3 = text_to_vector(myString3)
similar_string1 = []
similar_string2 = []
similar_string3 = []

for ele in mylist:  
    vector = text_to_vector(ele)
    cosine = get_cosine(vector1, vector)
    similar_string1.append(cosine)
    cosine = get_cosine(vector2, vector)
    similar_string2.append(cosine)
    cosine = get_cosine(vector3, vector)
    similar_string3.append(cosine)


        
print(similar_string1)
print(similar_string2)
print(similar_string3)
'''

The names of variables are the same as you mentioned in the question. Obviously, this code can be optimized according to your requirement.

Let me know if you didn't understand anything.

MrRaghav
  • 335
  • 3
  • 11