13

I'm trying to compute the similarity between a set of queries and a set a result for each query. I would like to do this using tfidf scores and cosine similarity. The issue that I'm having is that I can't figure out how to generate a tfidf matrix using two columns (in a pandas dataframe). I have concatenated the two columns and it works fine, but it's awkward to use since it needs to keep track of which query belongs to which result. How would I go about calculating a tfidf matrix for two columns at once? I'm using pandas and sklearn.

Here's the relevant code:

tf = TfidfVectorizer(analyzer='word', min_df = 0)
tfidf_matrix = tf.fit_transform(df_all['search_term'] + df_all['product_title']) # This line is the issue
feature_names = tf.get_feature_names() 

I'm trying to pass df_all['search_term'] and df_all['product_title'] as arguments into tf.fit_transform. This clearly does not work since it just concatenates the strings together which does not allow me to compare the search_term to the product_title. Also, is there maybe a better way of going about this?

maxymoo
  • 35,286
  • 11
  • 92
  • 119
David
  • 1,398
  • 1
  • 14
  • 20
  • You need to add a space in like this `df_all['search_term'] + " " + df_all['product_title']` other wise you might be combining the first word of the product with the last word of the search – maxymoo Apr 20 '16 at 02:20
  • also you don't need `analyzer=word`, since this is the default value – maxymoo Apr 20 '16 at 02:21
  • That line in my code is not what I want, I would like the terms and products to be separate so that I can compute cosine similarity between the search and the product. – David Apr 20 '16 at 02:24
  • i know, i'm just saying that if you are trying to combine them together you need to add the space in, you will need this sometime in the future – maxymoo Apr 20 '16 at 02:26

1 Answers1

15

You've made a good start by just putting all the words together; often a simple pipeline such as this will be enough to produce good results. You can build more complex feature processing pipelines using pipeline and preprocessing. Here's how it would work for your data:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

df_all = pd.DataFrame({'search_term':['hat','cat'], 
                       'product_title':['hat stand','cat in hat']})

transformer = FeatureUnion([
                ('search_term_tfidf', 
                  Pipeline([('extract_field',
                              FunctionTransformer(lambda x: x['search_term'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())])),
                ('product_title_tfidf', 
                  Pipeline([('extract_field', 
                              FunctionTransformer(lambda x: x['product_title'], 
                                                  validate=False)),
                            ('tfidf', 
                              TfidfVectorizer())]))]) 

transformer.fit(df_all)

search_vocab = transformer.transformer_list[0][1].steps[1][1].get_feature_names() 
product_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = search_vocab + product_vocab

print(vocab)
print(transformer.transform(df_all).toarray())

['cat', 'hat', 'cat', 'hat', 'in', 'stand']

[[ 0.          1.          0.          0.57973867  0.          0.81480247]
 [ 1.          0.          0.6316672   0.44943642  0.6316672   0.        ]]
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Thanks for your help. I'm trying to figure this out, but I can't seem to figure out what it's returning. When I run it, I'm not getting a tfidf matrix, is it giving me something else? Also, is it supposed to be accessing df_all? It doesn't seem like it's being referenced at all... – David Apr 20 '16 at 18:43
  • I've added an example calculation to hopefully make things clearer. to be honest, i can't work out exactly what variant of tf-idf is being used, i think it might be using log-frequencies even though in the docs is says it doesn't) – maxymoo Apr 21 '16 at 04:40
  • this guy's put some notes together which may clarify things https://github.com/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb – maxymoo Apr 21 '16 at 05:09
  • I would recommend using traditional functions over lambda, as lambda can cause unexpected behavior. See https://github.com/scikit-learn/scikit-learn/issues/9467 – Nick Morgan Apr 04 '19 at 23:02
  • I would recommend also using a transformer like this https://stackoverflow.com/a/52703546/7927776 – David Beauchemin Jul 10 '20 at 18:09