This is my code:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix, hstack
import os
sgd_classifier = SGDClassifier(loss='log', penalty='elasticnet', max_iter=30, n_jobs=60, alpha=1e-6, l1_ratio=0.7, class_weight='balanced', random_state=0)
vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(4,4), min_df=10)
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
print('TF-IDF number of features:', len(vectorizer.get_feature_names()))
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print('Inputs shape:', X_train.shape)
sgd_classifier.fit(X_train, y_train)
y_predicted = sgd_classifier.predict(X_test)
y_predicted_prob = sgd_classifier.predict_proba(X_test)
results_report = classification_report(y_test, y_predicted, labels=classes_trained, digits=2, output_dict=True)
df_results_report = pd.DataFrame.from_dict(results_report)
pd.set_option('display.max_rows', 300)
print(df_results_report.transpose())
X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively.
They first column is about the description of financial transactions and generally speaking each description consists of 5-15 words; so each line contains about 5-15 words. The second column is a categorical variable which just has the name of the bank related to this bank transaction.
I merge these two columns in one description so now X_text_train & X_text_test has shape (2M, ) and (100k, ) respectively.
Then I apply TF-IDF and now X_text_train & X_text_test has shape (2M, 50k) and (100k, 50k) respectively.
What I observe is that when there is an unseen value on the second column (so a new bank name in the merged description) then the SGDClassifier returns some very different and quite random predictions than what it would return if I had entirely dropped the second column with the bank names.
The same occurs if I do the TF-IDF only on the descriptions and keep the bank names separately as a categorical variable.
Why this happens with SGDClassifier
?
Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?
The interesting thing is that on TF-IDF the vocabulary is predetermined so unseen values in the test set are basically not taken into account at all in the features (ie all the respective features just have 0 as a value) but still the SGD breaks.
(I posted also this on skLearn's Github https://github.com/scikit-learn/scikit-learn/issues/21906)