6

Is there any built-in way to get scikit-learn to perform shuffled stratified k-fold cross-validation? This is one of the most common CV methods, and I am surprised I couldn't find a built-in method to do this.

I saw that cross_validation.KFold() has a shuffling flag, but it is not stratified. Unfortunately cross_validation.StratifiedKFold() does not have such an option, and cross_validation.StratifiedShuffleSplit() does not produce disjoint folds.

Am I missing something? Is this planned?

(obviously I can implement this by myself)

Aletheios
  • 3,960
  • 2
  • 33
  • 46
Bitwise
  • 7,577
  • 6
  • 33
  • 50

4 Answers4

5

The shuffling flag for cross_validation.StratifiedKFold has been introduced in the current version 0.15:

http://scikit-learn.org/0.15/modules/generated/sklearn.cross_validation.StratifiedKFold.html

This can be found in the Changelog:

http://scikit-learn.org/stable/whats_new.html#new-features

Shuffle option for cross_validation.StratifiedKFold. By Jeffrey Blackburne.

Mutabor
  • 86
  • 1
  • 4
2

I thought I would post my solution in case it is useful to anyone else.

from collections import defaultdict
import random
def strat_map(y):
    """
    Returns permuted indices that maintain class
    """
    smap = defaultdict(list)
    for i,v in enumerate(y):
        smap[v].append(i)
    for values in smap.values():
        random.shuffle(values)
    y_map = np.zeros_like(y)
    for i,v in enumerate(y):
        y_map[i] = smap[v].pop()
    return y_map

##########
#Example Use
##########
skf = StratifiedKFold(y, nfolds)
sm = strat_map(y)
for test, train in skf:
    test,train = sm[test], sm[train]
    #then cv as usual


#######
#tests#
#######
import numpy.random as rnd
for _ in range(100):
    y = np.array( [0]*10 + [1]*20 + [3] * 10)
    rnd.shuffle(y)
    sm = strat_map(y)
    shuffled = y[sm]
    assert (sm != range(len(y))).any() , "did not shuffle"
    assert (shuffled == y).all(), "classes not in right position"
    assert (set(sm) == set(range(len(y)))), "missing indices"


for _ in range(100):
    nfolds = 10
    skf = StratifiedKFold(y, nfolds)
    sm = strat_map(y)
    for test, train in skf:
        assert (sm[test] != test).any(), "did not shuffle"
        assert (y[sm[test]] == y[test]).all(), "classes not in right position"
John C Earls
  • 766
  • 10
  • 11
1

Here is my implementation of stratified shuffle split into training and testing set:

import numpy as np

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    test sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds


y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]
Apogentus
  • 6,371
  • 6
  • 32
  • 33
-3

As far as I know, this is actually implemented in scikit-learn.

""" Stratified ShuffleSplit cross validation iterator

Provides train/test indices to split data in train test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. """

rd108
  • 631
  • 2
  • 7
  • 14
  • As I wrote in my question, StratifiedShuffleSplit() does not do a shuffled version of StratifiedKFold(), i.e. shuffling prior to the StratifiedKFold(). This is even mentioned in the last sentence of your answer. KFold CV requires that there is no intersection between folds and that their union is the whole dataset. – Bitwise Jun 06 '13 at 16:05
  • Ah, yes the folds aren't guaranteed disjoin. Sorry for not reading to the end of your question.. – rd108 Jun 06 '13 at 21:36