SGD breaks down when encountering unseen values

Question

This is my code:

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix, hstack
import os


sgd_classifier = SGDClassifier(loss='log', penalty='elasticnet', max_iter=30, n_jobs=60, alpha=1e-6, l1_ratio=0.7, class_weight='balanced', random_state=0)


vectorizer = TfidfVectorizer(analyzer="char_wb", ngram_range=(4,4), min_df=10)
X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())
print('TF-IDF number of features:', len(vectorizer.get_feature_names()))


scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


print('Inputs shape:', X_train.shape)
sgd_classifier.fit(X_train, y_train)
y_predicted = sgd_classifier.predict(X_test)
y_predicted_prob = sgd_classifier.predict_proba(X_test)


results_report = classification_report(y_test, y_predicted, labels=classes_trained, digits=2, output_dict=True)


df_results_report = pd.DataFrame.from_dict(results_report)
pd.set_option('display.max_rows', 300)
print(df_results_report.transpose())

X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively.

They first column is about the description of financial transactions and generally speaking each description consists of 5-15 words; so each line contains about 5-15 words. The second column is a categorical variable which just has the name of the bank related to this bank transaction.

I merge these two columns in one description so now X_text_train & X_text_test has shape (2M, ) and (100k, ) respectively.

Then I apply TF-IDF and now X_text_train & X_text_test has shape (2M, 50k) and (100k, 50k) respectively.

What I observe is that when there is an unseen value on the second column (so a new bank name in the merged description) then the SGDClassifier returns some very different and quite random predictions than what it would return if I had entirely dropped the second column with the bank names.

The same occurs if I do the TF-IDF only on the descriptions and keep the bank names separately as a categorical variable.

Why this happens with SGDClassifier? Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?

The interesting thing is that on TF-IDF the vocabulary is predetermined so unseen values in the test set are basically not taken into account at all in the features (ie all the respective features just have 0 as a value) but still the SGD breaks.

(I posted also this on skLearn's Github https://github.com/scikit-learn/scikit-learn/issues/21906)

Hello @AntoineDubuis, thank you for the question. There is no error in the strict general sense of the term, it is just that the SGDClassifier returns some very different and quite random predictions when it encounters unseen values (even just at one or few of the features, not all). Not sure if this helps. — Outcast, Dec 07 '21 at 11:48

score 0 · Answer 1 · answered Dec 20 '21 at 17:20

X_text_train & X_text_test has shape (2M, 2) and (100k, 2) respectively and after the TF-IDF their shape is (2M, 50k) and (100k, 50k) respectively.

This I do not understand: in scikit-learn, text vectorizers are not expected to accept 2D inputs. They expect an iterable of str objects:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit

So it's not possible for X_text_train to have a shape other than (n_documents,).

X_train = vectorizer.fit_transform(X_text_train.ravel())
X_test = vectorizer.transform(X_text_test.ravel())

This does not make any sense to me: np.array([["a", "b"], ["c", "d"]], dtype=object).ravel() will return array(['a', 'b', 'c', 'd'], dtype=object). So this would generate 2 rows per original row in X_text_train.

Maybe you wanted to do something like the following?

X_concat_text_train = [x[0] + " " + x[1] for x in X_text_train]

Why this happens with SGDClassifier?

It's not really possible to answer your question precisely without having access to a minimal reproducible example with either minimal synthetic data or publicly available data.

Is it that SGD in general cannot handle well at all unseen values because of the fact that it converges in this stochastic way ?

You can answer the question by yourself by replacing SGDClassifier by LogisticRegression that uses the LBFGS solver that is non-stochastic.

"This I do not understand:" Hm strictly speaking good point. My post is not that clear but as I say above I firstly merge the text descriptions and the categorical so essentially the shapes are (2M,) and (100k,). — Outcast, Dec 20 '21 at 17:58
Any English description of the problem will stay ambiguous on many levels. This is why a minimal reproducible example with standalone data or synthetic data is required for anybody to help you. As long as you do not provide, nobody will be able to help you. This is really really important. If you do not make this effort how are we suppose to be able to help? — ogrisel, Dec 21 '21 at 00:14

SGD breaks down when encountering unseen values

1 Answers1