I have a dataframe like this:
Text A B C Label
337 nobodi can explain gave what we did ... 0 1 0 1
338 provide an example 1 1 0 0
339 another one???? 1 0 0 1
I would like to understand how to build a ML classifier. Currently, I did as follows:
X = train[['Text','A','B','C']]
y = train['Label']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=40)
# Returning to one dataframe
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)
valid_df = pd.concat([X_valid, y_valid], axis=1)
Then I create features using BOW and TFIDF:
countV = CountVectorizer()
train_count = countV.fit_transform(train_df['Text'].values)
# To create tfidf frequency features
tfidfV = TfidfTransformer()
train_tfidf = tfidfV.fit_transform(train_count)
tfidf_ngram = TfidfVectorizer(stop_words='english',ngram_range=. (1,2),use_idf=True,smooth_idf=True)
However, when I build the models, for example a NB model:
nb_pipeline = Pipeline([
('NBCV', countV),
('nb_clf',MultinomialNB())])
nb_pipeline.fit(train_df['Text'],train_df['Label'])
predicted_nb = nb_pipeline.predict(test_df['Text'])
np.mean(predicted_nb == test_df['Label'])
Something does not work, as I loose information on my dummy variables A,B,C. I have only features from Text. I can check this when I try to look at features importance:
feature_names = nb_pipeline.named_steps["NBCV"].get_feature_names()
coefs = nb_pipeline.named_steps["nb_clf"].coef_.flatten()
import pandas as pd
zipped = zip(feature_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
df["ABS"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("ABS", ascending=True)
Can you explain me why I am loosing this information and how I can keep my dummy variables in the model? Those variables should be very meaningful for the model, so I cannot exclude them from the model build. I need to check accuracy of the model and see the impact of those variables on that.