Text and dummy variables in ML - features selection

Question

I have a dataframe like this:

       Text                                             A   B   C    Label
337 nobodi can explain gave what we did ...             0   1   0      1
338 provide an example                                  1   1   0      0
339 another one????                                     1   0   0      1

I would like to understand how to build a ML classifier. Currently, I did as follows:

X = train[['Text','A','B','C']]
y = train['Label']

# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_train, X_valid, y_train, y_valid  = train_test_split(X, y, test_size=0.25, random_state=40) 
# Returning to one dataframe
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
valid_df = pd.concat([X_valid, y_valid], axis=1)

Then I create features using BOW and TFIDF:

countV = CountVectorizer()
train_count = countV.fit_transform(train_df['Text'].values)

# To create tfidf frequency features

tfidfV = TfidfTransformer()
train_tfidf = tfidfV.fit_transform(train_count)

tfidf_ngram = TfidfVectorizer(stop_words='english',ngram_range=. (1,2),use_idf=True,smooth_idf=True)

However, when I build the models, for example a NB model:

nb_pipeline = Pipeline([
        ('NBCV', countV),
        ('nb_clf',MultinomialNB())])

nb_pipeline.fit(train_df['Text'],train_df['Label'])
predicted_nb = nb_pipeline.predict(test_df['Text'])
np.mean(predicted_nb == test_df['Label'])

Something does not work, as I loose information on my dummy variables A,B,C. I have only features from Text. I can check this when I try to look at features importance:

feature_names = nb_pipeline.named_steps["NBCV"].get_feature_names()
coefs = nb_pipeline.named_steps["nb_clf"].coef_.flatten()

import pandas as pd
zipped = zip(feature_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
df["ABS"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("ABS", ascending=True)

Can you explain me why I am loosing this information and how I can keep my dummy variables in the model? Those variables should be very meaningful for the model, so I cannot exclude them from the model build. I need to check accuracy of the model and see the impact of those variables on that.

Do you want to use `pipeline` for this? Or would you accept an answer without it? Additionally, you should never mix your `validation` and `test` sets. — artemis, May 11 '21 at 13:05
Hi wundermanh. Thank you for your comment on valid and test sets. If it would be possible to show all the steps in the answer (including pipeline and the 'proof' that features were included in the features selection), it would be great. — Math, May 11 '21 at 13:10
OK, give me a second. Note, I am creating my own data since I only have a few rows of yours, but most should be copy-paste. — artemis, May 11 '21 at 13:10
Thank you @wundermahn. If you could also explaining what I have been doing wrong, it would be extremely helpful too for better understanding and not redoing same mistakes in the future. — Math, May 11 '21 at 13:15

score 2 · Answer 1 · answered May 11 '21 at 14:29

There are a few things to unpack here, so let's walk through them:

Validation, Test, Training data

Firstly, never mix your validation and testing data. This article provides more than 5 different quotes from various academic textbooks, including leading industry leaders such as Max Kuhn, that specifically highlight why you need to provide the model with completely untouched and unseen data for final evaluation. I would suggest reading that.

Getting the pipeline to work properly

In order to get your pipeline to work properly, you will need to use [make_column_transformer][3], which is new as of sklearn v 20.0.0. I tried to recycle as much code as I could from your example, so please note any subtle differences.

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Added
from sklearn.compose import make_column_transformer

# Create some test data to mimic OP's example

bad_sentences = ["I hated this restauraunt!", "I do not like Stack Overflow", "I am sad about the economy", 
                 "I do not feel good about the new Python update", "I think I am allergic to hazelnut!", "I will never eat here",
                 "I am never coming back!", "I feel really sad after hearing the news", "George is getting upset!", "Python!"]
good_sentences = ["I really liked this place", "Stack Overflow is a great resource", "I am glad we are getting vaccinated",
                  "I love anime", "I miss you too!", "I am upset we don't hang out more, I love it!", 
                  "Why are we feeling sad? Be happy!", "Hi, my name is wundermahn, what is your name?", "Lol!", 
                  "I am the master of my domain"]

# Create things a bit more verbosely so OP understands what we are doing

sentences = bad_sentences+good_sentences

a_bad = [1 for i in range(len(bad_sentences))]
a_good = [0 for i in range(len(good_sentences))]
a = a_bad+a_good

b_bad = [1 for i in range(len(bad_sentences))]
b_good = [0 for i in range(len(good_sentences))]
b = b_bad+b_good

c_bad = [1 for i in range(len(bad_sentences))]
c_good = [0 for i in range(len(good_sentences))]
c = c_bad+c_good

label_bad = [1 for i in range(len(bad_sentences))]
label_good = [0 for i in range(len(good_sentences))]
label = label_bad+label_good

# Create dataframe
df = pd.DataFrame({'Text': sentences, 'A': a, 'B': b, 'C': c, 'label': label})

# NEVER mix your validation and testing data!
# https://stackoverflow.com/questions/28556942/pandas-remove-rows-at-random-without-shuffling-dataset

np.random.seed(38)
remove_n = 2
drop_indices = np.random.choice(df.index, remove_n, replace=False)
valid_df = df.iloc[drop_indices]
remaining_df = df.drop(drop_indices)

X_train, X_test, y_train, y_test = train_test_split(remaining_df[[col for col in remaining_df.columns if col != 'label']],
                                                    remaining_df['label'], test_size=0.2, random_state=38)

# Get CountVectorizer working as an example, you can add tfidf later on

# Now, create your pipeline which should include your vectorizer, as well as your model you plan on training
nb_pipeline = Pipeline([
                        ('vectorizer', make_column_transformer((CountVectorizer(), 'Text'), remainder='passthrough')),
                        ('classifier', MultinomialNB())
                      ])

# Now, we can effectively train our model using the proper feature set
nb_pipeline.fit(X_train, y_train)


# Now, get prediction
predicted_nb = nb_pipeline.predict(X_test)

# Print accuracy
print(np.mean(predicted_nb==y_test))

# Get feature names

# Note, we need to slightly edit how we get the names now that we are using a different transformation pipeline
# https://stackoverflow.com/questions/54646709/sklearn-pipeline-get-feature-names-after-onehotencode-in-columntransformer
feature_names = nb_pipeline['vectorizer'].transformers_[0][1].get_feature_names()
coefs = nb_pipeline.named_steps["classifier"].coef_.flatten()

# Your code
zipped = zip(feature_names, coefs)
features_df = pd.DataFrame(zipped, columns=["feature", "value"])
features_df["ABS"] = features_df["value"].apply(lambda x: abs(x))
features_df["colors"] = features_df["value"].apply(lambda x: "green" if x > 0 else "red")
features_df = df.sort_values("ABS", ascending=True)

# See results on validation set
valid_preds = nb_pipeline.predict(valid_df[['Text', 'A', 'B', 'C']])
print(np.mean(valid_preds==valid_df['label']))

Please note, I am using: Python 3.8.8, sklearn 0.24.1, pandas 1.2.4, numpy 1.20.2

1. Basically using a random seed to randomly select 2 rows (`remove_n`) to use as a validation set. 2. Doesn't _quite_ make sense to me, especially in a naive bayes approach. Are you familiar with how naive bayes operates? If you are looking for generic feature importances, I really think you should consider looking into `shap`. — artemis, May 11 '21 at 15:10
There are a lot of ways to do that -- suggest looking into [this book][(https://christophm.github.io/interpretable-ml-book/). An easy way to do this would be using the package, `shap`. That is a model agnostic approach to determining feature importances and is academically referenced frequently. — artemis, May 11 '21 at 15:17
No problem. Feel free to "start a conversation" with me if you want/need to discuss `shap`, or to ask another question and tag me in a comment if you cannot get it working :) Happy ML-ing! — artemis, May 11 '21 at 15:22
I checked the words listed for features importance in the example you shared. Maybe I am wrong, but it seems that it counts and consider only words/features from Text column, and not values of A, B, and C. So it seems, but maybe I am wrong, that there is no contribute from those columns. They have boolean values (1/0) because they were converted using dummy features. Is it possible that the classifier is not considering or just not looking at the other three columns/features for prediction? — Math, May 11 '21 at 19:41
No, it is looking at all features. What you are doing above isn't really feature importance, and you are only getting your names from the `vectorizer`, but the model is clearly looking at all features. I can write a test to prove that to you. — artemis, May 12 '21 at 13:33
Thanks wundermahn. Trying a different approach, I have got other features in the top 20 features (those ones I would expect). I would be interested in feature importance for my model as I would like to understand which variable is giving more contribute in the model (but trying to look at all of them or at selected ones, like A, B,C). What it is important for me would be to have an output for feature importance. — Math, May 12 '21 at 13:36
Sure. What you are doing isn't feature importance, though, and I _think_ that is kind of a different question than what is asked here. Have you asked a question related to retrieving feature importances for a `Naive Bayes` model? — artemis, May 12 '21 at 13:38
We should not be using this comment thread to discuss how to find feature importances for a `Naive Bayes` model. They are two separate issues. You can open a new question for that, though it's mostly answered in this question: https://stackoverflow.com/questions/50526898/how-to-get-feature-importance-in-naive-bayes If you want a reusable method that will work across different algorithms, you will need to, again, look into something like SHAP or LIME. — artemis, May 12 '21 at 16:17
You are right. thanks for the link. The only thing that I am not understanding (and that has generated a lot of comments around NV and its features importance) and that is related to my question posted is in how to see the importance of the dummy variables. This is unfortunately still missing and I would need this information to see the importance of the variables (all of them) in my model, how they contribute, if they contribute. — Math, May 12 '21 at 16:26
@Math that is a separate question. I think you have all of the components for a nicely thought out, minimal, reproducible example. Please post it as a separate question. — artemis, May 12 '21 at 16:54
sorry, wundermahn, but I respectfully disagree. It is true that you have provided a minimal, reproducible example and I had also marked as solution before testing it. But it actually is not answering my question, i.e., providing the whole list of features, including the dummy variables). My code had not errors, the only thing that it was not doing was to consider also the dummy variables in the features importance. It would be ok to consider another classifier as NB is problematic in this term, but my question is on how to include and visualise dummy variables. My code is not doing so — Math, May 12 '21 at 17:07
sorry wundermahn, but I wanted to see since the beginning features importance for my dummy variables. This would have been the difference between my code and others' codes.It was not about validation and test (I was not using validation in my code and this was not in-scope). It was about seeing if the dummy variables were contributing and how much contribute they were given to my model. I am very sorry for all this misunderstanding. I appreciated your time and your answer, but it is not actually showing me how to visualize features importance for all my variables — Math, May 12 '21 at 17:54
Quoting myself:`I would need to understand how to use both Text and dummy variables in ML for features selection purposes`, if you also look at my question, you can read more details on what it was wrong in my code and what was missing: `Can you explain me why I am loosing this information and how I can keep my dummy variables in the model?`. Since we are respectfully disagree of what I was asking and clearly there was a misunderstanding, to avoid to extend comments, I will stop here to further comment. Again, I really appreciated your time in helping me — Math, May 12 '21 at 17:58
I awarded you of 100 points, but since it has not fully answered my original question, I had to unmark. Thank you again for your time and help. I appreciated it. — Math, May 13 '21 at 21:35

Text and dummy variables in ML - features selection

1 Answers1

Validation, Test, Training data

Getting the pipeline to work properly