I'm trying to build an algorithm capable of predicting if I will like an article, based on the previous articles I liked.
Example:
- I read 50 articles, I liked 10. I tell my program I liked them.
- Then 20 new articles are coming. My program has to give me a "percentage of like" for each new articles, based on the 10 I previously liked.
I found a lead here: Python: tf-idf-cosine: to find document similarity
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()
>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
And then, to compare the first document of the dataset to the others documents in the dataset:
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
For my case, what I think I will do is to concatenate the text of my 10 articles, run the TfidfVectorizer, and then compare the new big vector to each new article coming.
But I wonder how the comparison will be done:
- the big vector (10 articles) compared to the little one OR
- the little one compared to the big one
I don't know if you get my point, but in the first case 90 % of the words in the big vector won't be in the little one.
So my question is: how is the cosine similarity calculated ? Do you see a better approach for my project ?