Below is a code snippet showing scoring test documents based on TF-IDF in scikit-learn.
How do I get the top 5 vocabulary elements for each row in x_test_tfidf, and their scores?
I know count_vect.get_feature_names
can get the words corresponding to each column, but I don't know how to 1) get top 5 largest columns per row (something like this?), and 2) map the feature names to those columns (perhaps by setting an index?).
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
df = pd.DataFrame({'text':[
'this is sentence one, about one thing',
'this is sentence two, about another thing',
'this is sentence three, about a third thing',
'this is sentence four, about a fourth thing']})
train, test = train_test_split(df, test_size=0.5, random_state=42)
# Transform words (unigrams and bigrams) via tfidf
# See https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
# See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vect = CountVectorizer(ngram_range=(1, 2))
tfidf_transformer = TfidfTransformer()
x_train_counts = count_vect.fit_transform(train['text'])
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
# Get the test matrix using the trained tf-idf numbers
x_test_counts = count_vect.transform(test['text'])
x_test_tfidf = tfidf_transformer.transform(x_test_counts)
# Produce tfidf scores for query_text
query_text = 'what about another thing'
query_text_df = pd.DataFrame({'text': [query_text]})
query_text_counts = count_vect.transform(query_text_df['text'])
query_text_tfidf = tfidf_transformer.transform(query_text_counts)
# Produce scores that match test set with query_text
scores = x_test_tfidf * query_text_tfidf.T
print(scores)
The desired outcome would be something like:
[[('about', 0.6), ('another', 0.6), ('thing', 0.4)],
[('about', 0.6), ('thing', 0.4)]]
because the two test rows had those words that matched the query_text.
EDIT: Below is a partial answer, but without the "top 5" functionality, and the output looks messy.
Perhaps to get non-messy top 5 final results, it should be in "long" form, i.e. one row is a single cell.
result = pd.DataFrame(
data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
columns=count_vect.get_feature_names())
with pd.option_context('display.max_rows', None,
'display.max_columns', None):
print(result)
The output:
about about one about third is is sentence one one about \
0 0.267261 0.0 0.0 0.0 0.0 0.0 0.0
1 0.316228 0.0 0.0 0.0 0.0 0.0 0.0
one thing sentence sentence one sentence three thing third \
0 0.0 0.0 0.0 0.0 0.267261 0.0
1 0.0 0.0 0.0 0.0 0.316228 0.0
third thing this this is three three about
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
EDIT 2: Found the rest of the answer here, and wrote it up as an answer.