How to perform undersampling (the right way) with python scikit-learn?

Question

I am attempting to perform undersampling of the majority class using python scikit learn. Currently my codes look for the N of the minority class and then try to undersample the exact same N from the majority class. And both the test and training data have this 1:1 distribution as a result. But what I really want is to do this 1:1 distribution on the training data ONLY but test it on the original distribution in the testing data.

I am not quite sure how to do the latter as there is some dict vectorization in between, which makes it confusing to me.

# Perform undersampling majority group
minorityN = len(df[df.ethnicity_scan == 1]) # get the total count of low-frequency group
minority_indices = df[df.ethnicity_scan == 1].index
minority_sample = df.loc[minority_indices]

majority_indices = df[df.ethnicity_scan == 0].index
random_indices = np.random.choice(majority_indices, minorityN, replace=False) # use the low-frequency group count to randomly sample from high-frequency group
majority_sample = data.loc[random_indices]

merged_sample = pd.concat([minority_sample, majority_sample], ignore_index=True) # merging all the low-frequency group sample and the new (randomly selected) high-frequency sample together
df = merged_sample
print 'Total N after undersampling:', len(df)

# Declaring variables
X = df.raw_f1.values
X2 = df.f2.values
X3 = df.f3.values
X4 = df.f4.values
y = df.outcome.values

# Codes skipped ....
def feature_noNeighborLoc(locString):
    pass
my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4]
# Codes skipped ....

# Dict vectorization
all_dict = []
for i in range(0, len(my_dict)):
    temp_dict = dict(
        my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
        + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
        + my_dict9[i].items() + my_dict10[i].items()
        + my_dict11[i].items() + my_dict12[i].items() + my_dict13[i].items() + my_dict14[i].items()
        + my_dict19[i].items()
        + my_dict16[i].items() # location feature
        )
all_dict.append(temp_dict)

newX = dv.fit_transform(all_dict)

X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit)

# Fitting X and y into model, using training data
classifierUsed2.fit(X_train, y_train)

# Making predictions using trained data
y_train_predictions = classifierUsed2.predict(X_train)
y_test_predictions = classifierUsed2.predict(X_test)

Seems your indentation is wrong. Ad rem: not sure what's the problem to use the test data you want in the last line. — BartoszKP, Jan 16 '16 at 20:44
Indentation corrected. The original distribution is say 20:1 majority:minority class. My code makes it that both testing and training data to be 1:1 majority:minority. I was advised by some ML consultant that the 1:1 ratio should be in the training set, but retain the original 20:1 ratio in the testing set. — KubiK888, Jan 16 '16 at 20:54
@You're just repeating what you say in your question. What's stopping you from using data with 20:1 ratio in the testing phase? Btw. your for loop doesn't make any sense. — BartoszKP, Jan 16 '16 at 21:26
The loop works just fine, I have not posted my entire code since it will be 600 lines, so I need to cut some stuff. I showed that to illustrate the dict vectorization process. I am not sure if I understand your first question. I have done 20:1 test and training in another code set which I didn't include the under-sampling codes. If you are asking what stops me from testing the 20:1 test data from the above code, then the ans I don't have it anymore. The df = merged_sample is after under-sampling, and each of the following: X_train, X_test, y_train, y_test derived from this df. — KubiK888, Jan 16 '16 at 21:37
I have added some sample code before the loop to illustrate the logic here. — KubiK888, Jan 16 '16 at 21:41
1) Perhaps it "works", but it's pointless - `temp_dict` always has the value from the last iteration. 2) `y_test` AFAIR has a simple structure - IIRC it's quite easy to build your own `y_test` from whatever data you want. — BartoszKP, Jan 16 '16 at 22:27

score 4 · Accepted Answer · answered Jan 17 '16 at 02:34

4

You want to subsample the training samples of one of your categories because you want a classifier that treats all the labels the same.

If you want to do that instead of subsampling you can change the value of the 'class_weight' parameter of your classifier to 'balanced' (or 'auto' for some classifiers) which does the job that you want to do.

You can read the documentation of LogisticRegression classifier as an example. Notice the description of the 'class_weight' parameter here.

By changing that parameter to 'balanced' you won't need to do the subsampling anymore.

answered Jan 17 '16 at 02:34

Ash

3,428
1
34
44

Should I expect the performance be better or worse than without class_weight=balanced? I run it and the performance worsens. And I thought generally undersampling should improve the performance. – KubiK888 Jan 18 '16 at 01:37
My experience is that the accuracy worsens. The reason is that the imbalance also exists in your test set. This weighting is good for cases when getting a label wrong has very bad consequences. For example when one does a cancer test, you need to identify the suspicious cases even if they don't have that. So I recommend not doing subsampling and set the class_weight to None. This is what I usually do. – Ash Jan 19 '16 at 00:20
So it is not uncommon that the accuracy worsens AFTER using the adjusted class weight? I am not completely clear, did you recommend to set the class_weight to "balanced" or "none"? Thanks. – KubiK888 Jan 19 '16 at 04:13
Yeah, the accuracy usually worsens AFTER using class_weight='balanced' so I suggest set the class_weight to 'none'. – Ash Jan 19 '16 at 06:39
1

I thought you suggested to USE class weight in order to deal with unbalance data WITHOUT undersampling, if I set the class_weight="none", wouldn't that be just the default and still doesn't take into consideration of the imbalanced data issue? – KubiK888 Jan 19 '16 at 07:44
Yes. If you set the class_weight to 'none', it means you're ignoring the class imbalance. If you want to address the class imbalance or not, depends on the application and your evaluation measure. For example you can read about micro and macro f1 measures to know the effect of evaluation measure on if you want to address class imbalance. Addressing class imbalance usually increases macro f1 but results in lower micro f1. – Ash Jan 19 '16 at 08:01

How to perform undersampling (the right way) with python scikit-learn?

1 Answers1

Linked