2

I'm struggling on creating CountVectorizer model on a text dataframe that I have. The dataframe contains 4 columns with a relatively long text. For example:

Description     Comments         Summary           System Log
text text text  text text text   text text text    text text text

I created this function that work well on each column separately, but I can't figure out how to do the same for the all df together:

    vectorizer = CountVectorizer(max_features=1500, max_df = 0.90, min_df = 0.05)
    X = vectorizer.fit_transform(df).toarray()
    tfidfconverter = TfidfTransformer()
    X = tfidfconverter.fit_transform(X).toarray()
    df = pd.DataFrame(X, columns = vectorizer.get_feature_names())

    return df

The output that I'm looking to get is a df that will look something like this:

       able    above   abpwrk  accessor    according   action      activity    actual      without 
0       0.0     0.0     0.0     0.00000     0.0         0.000000    0.0         0.000000    0.000000    
1       0.0     0.0     0.0     0.07126     0.0         0.249390    0.0         0.000000    0.000000    

It works if I'm merging all the columns into one column of text, but something tells me there must be a smarter solution. Any idea?

Nati
  • 48
  • 6
  • The vectorizer is meant to turn a single set of documents into a vectorized feature set. What you're describing, if I'm not mistaken, is making a single feature set out of multiple distinct documents (the different columns in each row). To do this, you would either need to vectorize each column of documents individually and combine the results, or combine the documents in each row into a single document and vectorize them together. – G. Anderson Dec 19 '19 at 17:26
  • Does this answer your question? [use Featureunion in scikit-learn to combine two pandas columns for tfidf](https://stackoverflow.com/questions/34710281/use-featureunion-in-scikit-learn-to-combine-two-pandas-columns-for-tfidf) – G. Anderson Dec 19 '19 at 17:27
  • When I did that to each column separately ended up with duplicated words. I'm looking for a method that will calculate the occurrence across the all df. – Nati Dec 19 '19 at 18:11
  • Then, as in the linked answer, the answer is to concatenate each record into a single document and vectorize that combined document column. Something like `X = vectorizer.fit_transform(df['Description']+' '+df['Comments']+...)`. As to your question "there must be a smarter solution", combining the documents is really the way to go – G. Anderson Dec 19 '19 at 18:17
  • Yes... I guess.. Thanks a lot mate! – Nati Dec 19 '19 at 18:18

0 Answers0