I am using python and scikit-learn to find the cosine similarity between two strings(specifically, names).The program is able to find the similarity score between two strings but, when strings are abbreviated, it shows some undesirable output.
e.g- String1 ="K KAPOOR",String2="L KAPOOR" The cosine similarity score of these strings is 1(maximum) while the two strings are entirely different names.Is there a way to modify it, in order to get some desired results.
My code is:
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 9 14:40:21 2015
@author: gauge
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents=("K KAPOOR","L KAPOOR")
tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
#print tfidf_matrix.shape
cs=cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)
print cs