4

Below is a code snippet showing scoring test documents based on TF-IDF in scikit-learn.

How do I get the top 5 vocabulary elements for each row in x_test_tfidf, and their scores?

I know count_vect.get_feature_names can get the words corresponding to each column, but I don't know how to 1) get top 5 largest columns per row (something like this?), and 2) map the feature names to those columns (perhaps by setting an index?).

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

df = pd.DataFrame({'text':[
    'this is sentence one, about one thing',
    'this is sentence two, about another thing',
    'this is sentence three, about a third thing',
    'this is sentence four, about a fourth thing']})
train, test = train_test_split(df, test_size=0.5, random_state=42)

# Transform words (unigrams and bigrams) via tfidf
# See https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
# See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vect = CountVectorizer(ngram_range=(1, 2))
tfidf_transformer = TfidfTransformer()

x_train_counts = count_vect.fit_transform(train['text'])
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

# Get the test matrix using the trained tf-idf numbers
x_test_counts = count_vect.transform(test['text'])
x_test_tfidf = tfidf_transformer.transform(x_test_counts)

# Produce tfidf scores for query_text
query_text = 'what about another thing'
query_text_df = pd.DataFrame({'text': [query_text]})
query_text_counts = count_vect.transform(query_text_df['text'])
query_text_tfidf = tfidf_transformer.transform(query_text_counts)

# Produce scores that match test set with query_text
scores = x_test_tfidf * query_text_tfidf.T
print(scores)

The desired outcome would be something like:

[[('about', 0.6), ('another', 0.6), ('thing', 0.4)],
[('about', 0.6), ('thing', 0.4)]]

because the two test rows had those words that matched the query_text.

EDIT: Below is a partial answer, but without the "top 5" functionality, and the output looks messy.

Perhaps to get non-messy top 5 final results, it should be in "long" form, i.e. one row is a single cell.

result = pd.DataFrame(
    data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
    columns=count_vect.get_feature_names())

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None):
    print(result)

The output:

  about  about one  about third   is  is sentence  one  one about  \
0  0.267261        0.0          0.0  0.0          0.0  0.0        0.0   
1  0.316228        0.0          0.0  0.0          0.0  0.0        0.0   

one thing  sentence  sentence one  sentence three     thing  third  \
0        0.0       0.0           0.0             0.0  0.267261    0.0   
1        0.0       0.0           0.0             0.0  0.316228    0.0   

third thing  this  this is  three  three about  
0          0.0   0.0      0.0    0.0          0.0  
1          0.0   0.0      0.0    0.0          0.0  

EDIT 2: Found the rest of the answer here, and wrote it up as an answer.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
dfrankow
  • 20,191
  • 41
  • 152
  • 214

1 Answers1

1

This worked for me.

# Produce top words between search text and each test set text
# See also https://stackoverflow.com/a/40434047/34935
tmp = pd.DataFrame(data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
                   columns=count_vect.get_feature_names())
tmp = tmp.apply(lambda row: sorted(zip(tmp.columns, row),
                                   key=lambda cv: -cv[1]), axis=1)

nlargest = 5
vals = []
for key, val in zip(tmp.index, tmp.values.tolist()):
    val_tuples = val[:nlargest]
    vals.append('%d|%s' % (key, ', '.join(
        [str(tup) for tup in val_tuples])))

test['top_keywords'] = vals
dfrankow
  • 20,191
  • 41
  • 152
  • 214