Scikit-Learn Random Forest Classifier: High accuracy on Training and Test, but not Production

Question

I am training a classifier to predict which classifies text-based requests into departments. I have ~107,000 labeled examples made of 22 imbalanced classes with roughly the following distribution:

Class 1: 10,000
Class 2: 60,000
Class 3: 7,000
Class 4: 5,000
Class 5: 3,500
Classes 6 & 7: 2000 samples each
Classes 7-15: 1500 samples each
Classes 16-22: 500 samples each

I have been preprocessing the data to provide an even number of samples (where each class has anywhere from 5,000 samples to 50,000 samples). Which the above classifier and balancing the training data, I am able to get up to 98.5% accuracy on the test data with a 50-50 split of the total training data. But as new requests come in and I load the classifier, the classifier only achieves 50-70% accuracy at best. The sample is relatively stable that the same requests always go to the same department, so I am very surprised to only be 50-70% accurate, especially with such a high accuracy on the test data:

import logging
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.externals import joblib
from sklearn.metrics import classification_report

logger = logging.getLogger(__name__)

def up_sample(data, labels, **kwargs):
    label_counts = Counter(labels)
    max_label = max(label_counts, key=label_counts.get)
    max_label_count = kwargs.get('samples', label_counts[max_label])
    output_text = []
    output_labels = []
    for label, count in label_counts.items():
        label_text = [data_row for data_row, label_row in zip(data, labels) if label_row == label]
        resampled_labels = [label] * max_label_count
        resampled_text = resample(label_text, n_samples=max_label_count, random_state=0)
        output_text = output_text + resampled_text
        output_labels = output_labels + resampled_labels
    return output_text, output_labels


clf = Pipeline(
    steps=(('tfidf_vectorizer', TfidfVectorizer(stop_words='english')),
    ('clf', RandomForestClassifier(n_estimators=250, n_jobs=-1)))
)

resampled_data, resampled_labels = upsample(data, labels) # UPDATE:  produces ~700,000 samples, which many duplicates

labels = label_encoder.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.5, random_state=0) # UPDATE: many duplicates in both training and test data sets as a result of upsampling

clf.fit(X_train, y_train)

test_score = clf.score(X_test, y_test)
logger.debug('Test Score: %s', test_score) # 0.98-0.99%

cross_validation_results = cross_val_score(clf, data, labels)
logger.debug('Cross Validation results: %r', cross_validation_results) # [98.7, 99.1, 97.8]

y_test_predicted = clf.predict(X_test)
output_classification_report = classification_report(y_test, y_test_predicted, target_names=label_encoder.classes_)
logger.debug(output_classification_report)  # 0.95-1.0 for precision and recall for all classes

clf_file_name = os.path.join(directory, clf_name)
joblib.dump(clf, clf_file_name)

label_encoder_file_name = os.path.join(directory, label_encoder_name)
joblib.dump(label_encoder, label_encoder_file_name)

# Later, in a different script
clf_file_name = os.path.join(directory, name)
clf = joblib.load(clf_file_name)

label_encoder_file_name = os.path.join(directory, name)
label_encoder = joblib.load(label_encoder_file_name)

predictions = clf.predict(new_data)
logger.debug(clf.score(new_labels, predictions)) # 50-70%

Also, when I retrain the classifier with the new_data and predict on the new_data, it is 100% accurate. I know that it will score much higher since it has already seen the example, but I have been reading about out-of-bag error in random forests which I understand could be my issue, but I am not familiar enough with OOB to know how to correct for this. I do not know how to proceed from here. How do I go about resolving this issue?

I have already read through the following questions/resources for resolving my issue, prior to posting my own question, but feel free to let me know if I overlooked something from them:

Have you processed the data and labels with the TfidfVectorizer outside of pipeline? Whats `self._clf`? Please make a reproducible code (all your code in single snippet). — Vivek Kumar, May 18 '18 at 05:53
`self._clf` was an overlooked typo from when I was separating my code from how the code was structured to work with other parts of the system for the purpose of this post. This has been corrected. Code is now a single snippet — DFenstermacher, May 18 '18 at 15:05
Getting 100% accuracy should be a red flag. That's not an indication of a great model, it's an indication of a horrible over-fit model. I would guess you have a 'data leak.' I bet there is a column in your training data that indicates what your output class should be. You may even be leaving the target variable as a column in X. Or it may be a more subtle leak. — Metropolis, May 18 '18 at 15:10
@Metropolis you were right. getting ~100% accuracy was not good. I did not find any leaks after checking `clf.feature_importances_`but realized the leak was that many examples appeared in both the training and test datasets due to oversampling. So the classifier was not generalizing, but simply seeing the data it was training on. After down_sampling (to keep the class representations balanced), I am now getting closer to 60%, which matched the results I was getting in production. So that was my data leak. Thank you for suggesting a leak, it was exactly what I needed to hear — DFenstermacher, May 18 '18 at 21:32
First, Resampling should only be done on training data, not whole data. Second, no need for label encoding, it will be handled automatically by the scikit. Apply these two changes and then measure the accuracy. That will be your the indication of real accuracy — Vivek Kumar, May 19 '18 at 01:11

Scikit-Learn Random Forest Classifier: High accuracy on Training and Test, but not Production

0 Answers0