I am training a classifier to predict which classifies text-based requests into departments. I have ~107,000 labeled examples made of 22 imbalanced classes with roughly the following distribution:
- Class 1: 10,000
- Class 2: 60,000
- Class 3: 7,000
- Class 4: 5,000
- Class 5: 3,500
- Classes 6 & 7: 2000 samples each
- Classes 7-15: 1500 samples each
- Classes 16-22: 500 samples each
I have been preprocessing the data to provide an even number of samples (where each class has anywhere from 5,000 samples to 50,000 samples). Which the above classifier and balancing the training data, I am able to get up to 98.5% accuracy on the test data with a 50-50 split of the total training data. But as new requests come in and I load the classifier, the classifier only achieves 50-70% accuracy at best. The sample is relatively stable that the same requests always go to the same department, so I am very surprised to only be 50-70% accurate, especially with such a high accuracy on the test data:
import logging
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.externals import joblib
from sklearn.metrics import classification_report
logger = logging.getLogger(__name__)
def up_sample(data, labels, **kwargs):
label_counts = Counter(labels)
max_label = max(label_counts, key=label_counts.get)
max_label_count = kwargs.get('samples', label_counts[max_label])
output_text = []
output_labels = []
for label, count in label_counts.items():
label_text = [data_row for data_row, label_row in zip(data, labels) if label_row == label]
resampled_labels = [label] * max_label_count
resampled_text = resample(label_text, n_samples=max_label_count, random_state=0)
output_text = output_text + resampled_text
output_labels = output_labels + resampled_labels
return output_text, output_labels
clf = Pipeline(
steps=(('tfidf_vectorizer', TfidfVectorizer(stop_words='english')),
('clf', RandomForestClassifier(n_estimators=250, n_jobs=-1)))
)
resampled_data, resampled_labels = upsample(data, labels) # UPDATE: produces ~700,000 samples, which many duplicates
labels = label_encoder.fit_transform(labels)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.5, random_state=0) # UPDATE: many duplicates in both training and test data sets as a result of upsampling
clf.fit(X_train, y_train)
test_score = clf.score(X_test, y_test)
logger.debug('Test Score: %s', test_score) # 0.98-0.99%
cross_validation_results = cross_val_score(clf, data, labels)
logger.debug('Cross Validation results: %r', cross_validation_results) # [98.7, 99.1, 97.8]
y_test_predicted = clf.predict(X_test)
output_classification_report = classification_report(y_test, y_test_predicted, target_names=label_encoder.classes_)
logger.debug(output_classification_report) # 0.95-1.0 for precision and recall for all classes
clf_file_name = os.path.join(directory, clf_name)
joblib.dump(clf, clf_file_name)
label_encoder_file_name = os.path.join(directory, label_encoder_name)
joblib.dump(label_encoder, label_encoder_file_name)
# Later, in a different script
clf_file_name = os.path.join(directory, name)
clf = joblib.load(clf_file_name)
label_encoder_file_name = os.path.join(directory, name)
label_encoder = joblib.load(label_encoder_file_name)
predictions = clf.predict(new_data)
logger.debug(clf.score(new_labels, predictions)) # 50-70%
Also, when I retrain the classifier with the new_data and predict on the new_data, it is 100% accurate. I know that it will score much higher since it has already seen the example, but I have been reading about out-of-bag error in random forests which I understand could be my issue, but I am not familiar enough with OOB to know how to correct for this. I do not know how to proceed from here. How do I go about resolving this issue?
I have already read through the following questions/resources for resolving my issue, prior to posting my own question, but feel free to let me know if I overlooked something from them: