2

I'm trying to perform Stratified K Fold Validation in python, and I read the following in the documentation:

enter image description here

I'm not exactly sure what this means. Could someone explain to me when exactly does cross_val_score use the StratifiedKFold strategy?

bugsyb
  • 5,662
  • 7
  • 31
  • 47

1 Answers1

2

When you are performing cross-fold validation, you are splitting up your training set into multiple validation sets. StratifiedKFold ensures that each of your validation sets contains an equal proportion of the labels from your original training set.

For example, let's say you are training a classifier on spam and not spam. Your training set contains 50k samples with 10k spam samples. If you perform 5-fold cross-fold validation, you will split up your training set into 5 validations of size 10k samples each. With stratification, each of your validation sets will be selected in a manner to maintain the 4:1 distribution of not spam to spam.

EDIT: I'm sorry I misunderstood your original question. To expand upon user @unutbu's comments below, you want to confirm that the classifier you are using is a subclass of the base class ClassifierMixin. You can do so using a Method Resolution Order.

Suppose you were using the classifier KNeighborsClassifier:

>>> from sklearn.neighbors import KNeighborsClassifier
>>> clf = KNeighborsClassifier()
>>> type(clf)
<class 'sklearn.neighbors.classification.KNeighborsClassifier'>
>>> type(clf).mro()
[<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, ..., <class 'sklearn.base.ClassifierMixin'>, <type 'object'>]

Notice that the second to last class in the resolution order is ClassifierMixin.

Chirag
  • 446
  • 2
  • 14
  • 1
    I understand that part, but how do I get it to perform StratifiedKFold? In the documentation it states that the StratifiedKFold strategy is used when the estimator derives from 'ClassifierMixin'. What exactly does that mean? – bugsyb Jul 05 '17 at 21:20
  • 2
    @bugsyb: The estimator is the first argument passed to `cross_val_score`. In the example, `clf` is the estimator. "The estimator derives from ClassifierMixin" is True when `isinstance(clf, sklearn.base.ClassifierMixin)`. You can see all the bases (i.e. classes) from which `type(clf)` is derived by looking at `type(clf).mro()`. You'll see `ClassifierMixin` is the second-to-last class listed there. – unutbu Jul 05 '17 at 21:24
  • For a little more on where the terminology "derived" comes from, see [the tutorial](https://docs.python.org/3.6/tutorial/classes.html#inheritance). For more on `mro` see [this SO question](https://stackoverflow.com/q/2010692/190597). – unutbu Jul 05 '17 at 21:28