18

Okay, so I have been following these two posts on TF*IDF but am little confused : http://css.dzone.com/articles/machine-learning-text-feature

Basically, I want to create a search query that contains searches through multiple documents. I would like to use the scikit-learn toolkit as well as the NLTK library for Python

The problem is that I don't see where the two TF*IDF vectors come from. I need one search query and multiple documents to search. I figured that I calculate the TF*IDF scores of each document against each query and find the cosine similarity between them, and then rank them by sorting the scores in descending order. However, the code doesn't seem to come up with the right vectors.

Whenever I reduce the query to only one search, it is returning a huge list of 0's which is really strange.

Here is the code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords

train_set = ("The sky is blue.", "The sun is bright.") #Documents
test_set = ("The sun in the sky is bright.") #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)

tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()
tabchas
  • 1,374
  • 2
  • 18
  • 37
  • I was wondering if you calculated the cosine using the final matrix that you get from print tfidf.todense() if so how do you do that? – add-semi-colons Aug 25 '12 at 02:05
  • 1
    Hey one sec... Ill post an example soon. – tabchas Aug 25 '12 at 17:23
  • Thanks that would be fantastic. Would you be putting a link here..? Thats even better. – add-semi-colons Aug 25 '12 at 19:08
  • 2
    On my GitHub page here: https://github.com/tabchas -- The code is under Disco-Search-Query. I am trying to implement a search query for Wikipedia but right now there is a simpletfidf.py file which should be what you are looking for. – tabchas Aug 25 '12 at 19:28
  • Thanks I am going to definitely look at it because I am actually constructing documents based on some google search so I still have few questions. But here is what I did based on your initial code: I have given credit to your first answer with link http://stackoverflow.com/questions/12118720/python-tf-idf-cosine-to-find-document-similarity – add-semi-colons Aug 25 '12 at 19:39
  • Oh awesome... maybe we can collaborate on a similar project! We can maybe chat on Skype? Username: robomanager – tabchas Aug 25 '12 at 19:43
  • yes definitely I have I have worked with wikipedia data before so definitely. I am going to see if your implementation and mine gives the same answer. – add-semi-colons Aug 25 '12 at 19:46
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/15808/discussion-between-tabchas-and-null-hypothesis) – tabchas Aug 25 '12 at 19:47

1 Answers1

16

You're defining train_set and test_set as tuples, but I think that they should be lists:

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query

Using this the code seems to run fine.

Sicco
  • 6,167
  • 5
  • 45
  • 61
  • Awesome. Thanks for advice. Any reason why it doesn't work with tuples? – tabchas Aug 11 '12 at 21:33
  • 2
    It is coded to take lists as input :). These lists are internally converted to NumPy arrays (you can also pass a NumPy array directly). – Sicco Aug 11 '12 at 21:49