Stratified K Fold in Python

Question

I'm trying to perform Stratified K Fold Validation in python, and I read the following in the documentation:

I'm not exactly sure what this means. Could someone explain to me when exactly does cross_val_score use the StratifiedKFold strategy?

Chirag · Answer 1 · 2017-07-05T21:38:34.460

When you are performing cross-fold validation, you are splitting up your training set into multiple validation sets. StratifiedKFold ensures that each of your validation sets contains an equal proportion of the labels from your original training set.

For example, let's say you are training a classifier on spam and not spam. Your training set contains 50k samples with 10k spam samples. If you perform 5-fold cross-fold validation, you will split up your training set into 5 validations of size 10k samples each. With stratification, each of your validation sets will be selected in a manner to maintain the 4:1 distribution of not spam to spam.

EDIT: I'm sorry I misunderstood your original question. To expand upon user @unutbu's comments below, you want to confirm that the classifier you are using is a subclass of the base class ClassifierMixin. You can do so using a Method Resolution Order.

Suppose you were using the classifier KNeighborsClassifier:

>>> from sklearn.neighbors import KNeighborsClassifier
>>> clf = KNeighborsClassifier()
>>> type(clf)
<class 'sklearn.neighbors.classification.KNeighborsClassifier'>
>>> type(clf).mro()
[<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, ..., <class 'sklearn.base.ClassifierMixin'>, <type 'object'>]

Notice that the second to last class in the resolution order is ClassifierMixin.

I understand that part, but how do I get it to perform StratifiedKFold? In the documentation it states that the StratifiedKFold strategy is used when the estimator derives from 'ClassifierMixin'. What exactly does that mean? — bugsyb, Jul 05 '17 at 21:20
@bugsyb: The estimator is the first argument passed to `cross_val_score`. In the example, `clf` is the estimator. "The estimator derives from ClassifierMixin" is True when `isinstance(clf, sklearn.base.ClassifierMixin)`. You can see all the bases (i.e. classes) from which `type(clf)` is derived by looking at `type(clf).mro()`. You'll see `ClassifierMixin` is the second-to-last class listed there. — unutbu, Jul 05 '17 at 21:24
For a little more on where the terminology "derived" comes from, see [the tutorial](https://docs.python.org/3.6/tutorial/classes.html#inheritance). For more on `mro` see [this SO question](https://stackoverflow.com/q/2010692/190597). — unutbu, Jul 05 '17 at 21:28

Stratified K Fold in Python

1 Answers1