5

Let's suppose that I have a dataframe with two columns in pandas which resembles the following one:

    text                                label
0   This restaurant was amazing         Positive
1   The food was served cold            Negative
2   The waiter was a bit rude           Negative
3   I love the view from its balcony    Positive

and then I am using TfidfVectorizer from sklearn on this dataset.

What is the most efficient way to find the top n in terms of TF-IDF score vocabulary per class?

Apparently, my actual dataframe consists of many more rows of data than the 4 above.

The point of my post to find the code which works for any dataframe which resembles the one above; either 4-rows dataframe or 1M-rows dataframe.

I think that my post is related quite a lot to the following posts:

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • Unless you are removing hapaxes explicitly, the unique words in your input documents will have the highest score, by the TFxIDF definition. If you have more than a few dozen words, "top 3" will be meaningless because all the top *n* words will have the same highest score, and often not be particularly good indicators of anything at all. – tripleee Jun 21 '19 at 12:25
  • 1
    @tripleee, thank you for your comment. However, I think/thought that it is more than obvious that the dataframe is only a small-sample-dataframe. My actual dataframe consists of number of data rows at the order of 100k. The point of my post to find the code which works for any dataframe like that; either 4-rows dataframe or 1M-rows dataframe. The same applies on whether it should be top 3 or top 100 or top whatever. Therefore, let's please focus on the matter in question that making remarks which only state the obvious. – Outcast Jun 21 '19 at 14:01
  • But the (indeed, obvious) answer to your question is "anything which occurs in only a single sample". A more useful question would be e.g. "which tokens have a high DF (i.e. a low IDF} in one set but not the other" but you're not asking that, and we can't really guess from your post if that's really what you actually want. – tripleee Jun 21 '19 at 14:48
  • Haha @tripleee my question is not which in general will be the top n (in terms of TF-IDF score) vocabulary per class because the answer is obvious and it is the one you state. My question is what code to use to efficiently find the top n (in terms of TF-IDF) score vocabulary per class in `sklearn` and `TfidfVectorizer`. So I need CODE not obvious text answers. – Outcast Jun 21 '19 at 14:59
  • If I understood what you wanted, I would perhaps post an answer. These are comments to hopefully get you to clarify what you are trying to accomplish. So you are really not looking for the top 3 most polarized terms, for example? – tripleee Jun 21 '19 at 16:15
  • @tripleee you "would **perhaps** post an answer" as you say so I do not really know what is the point of this statement. What I simply want is what the posts which I mention above do but I want it per class. See also the answer of some people below who understood what I was saying without asking a single question. – Outcast Jun 21 '19 at 16:22

3 Answers3

7

The following code will do the work (thanks to Mariia Havrylovych).

Assume we have an input dataframe, df, aligned with your structure.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# override scikit's tfidf-vectorizer in order to return dataframe with feature names as columns
class DenseTfIdf(TfidfVectorizer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        for k, v in kwargs.items():
            setattr(self, k, v)

    def transform(self, x, y=None) -> pd.DataFrame:
        res = super().transform(x)
        df = pd.DataFrame(res.toarray(), columns=self.get_feature_names())
        return df

    def fit_transform(self, x, y=None) -> pd.DataFrame:
        # run sklearn's fit_transform
        res = super().fit_transform(x, y=y)
        # convert the returned sparse documents-terms matrix into a dataframe to further manipulations
        df = pd.DataFrame(res.toarray(), columns=self.get_feature_names(), index=x.index)
        return df

Usage:

# assume texts are stored in column 'text' within a dataframe
texts = df['text']
df_docs_terms_corpus = DenseTfIdf(sublinear_tf=True,
                 max_df=0.5,
                 min_df=2,
                 encoding='ascii',
                 ngram_range=(1, 2),
                 lowercase=True,
                 max_features=1000,
                 stop_words='english'
                ).fit_transform(texts)


# Need to keep alignment of indexes between the original dataframe and the resulted documents-terms dataframe
df_class = df[df["label"] == "Class XX"]
df_docs_terms_class = df_docs_terms_corpus.iloc[df_class.index]
# sum by columns and get the top n keywords
df_docs_terms_class.sum(axis=0).nlargest(n=50)
Gilad Barkan
  • 71
  • 1
  • 3
2

In the following, you can find a piece of code I wrote more than three years ago for a similar purpose. I'm not sure if this is the most efficient way of doing what you're going to do, but as far as I remember, it worked for me.

# X: data points
# y: targets (data points` label)
# vectorizer: TFIDF vectorizer created by sklearn
# n: number of features that we want to list for each class
# target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
# --------------------------------------------
# splitting X vectors based on target classes
for label in target_list:
    # listing the most important words in each class
    indices = []
    current_dict = {}

    # finding indices the of rows (data points) for the current class
    for i in range(0, len(X.toarray())):
        if y[i] == label:
            indices.append(i)

    # get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
    vectors = np.mean(X[indices, :], axis=0)

    # creating a dictionary of features with their corresponding values
    for i in range(0, X.shape[1]):
        current_dict[X.indices[i]] = vectors.item((0, i))

    # sorting the dictionary based on values
    sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)

    # printing the features textual and numeric values
    index = 1
    for element in sorted_dict:
        for key_, value_ in vectorizer.vocabulary_.items():
            if element[0] == value_:
                print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
                index += 1
                if index == n:
                    break
        else:
            continue
        break
Pedram
  • 2,421
  • 4
  • 31
  • 49
  • Ok thank you (upvote)! That's a good start. I think that some other people may come up with a more efficient version of it. – Outcast Jun 21 '19 at 16:20
-1
top_terms = pd.DataFrame(columns = range(1,6))

for i in term_doc_mat.index:
    top_terms.loc[len(top_terms)] = term_doc_mat.loc[i].sort_values(ascending = False)[0:5].index

This will give you the top 5 terms for each document. Adjust as needed.

hp2500
  • 19
  • 6