2

This is a question about scikit learn (version 0.17.0) in Python 2.7 along with Pandas 0.17.1. In order to split raw data (with no missing entries) using the approach detailed here, I have found that if the split data are used to proceed with a .fit() that there is an error that appears.

Here is the code taken largely unchanged from the other stackoverflow question with renaming of variables. I have then instantiated a grid and tried to fit the split data with the aim of determining optimal classifier parameters. The error occurs after the last line of the code below:

import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")

# separate target variable from dataset
y = wine['quality']
X = wine.drop(['quality','color'],axis = 1)

# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(y, n_iter=3, test_size=0.2)

# Split dataset to obtain indices for train and test set
for train_index, test_index in sss:
    xtrain, xtest = X.iloc[train_index], X.iloc[test_index]
    ytrain, ytest = y[train_index], y[test_index]

# Pick some classifier here
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

from sklearn.grid_search import GridSearchCV
# Instantiate grid
grid = GridSearchCV(decision_tree, param_grid={'max_depth':np.arange(1,3)}, cv=sss, scoring='accuracy')

# this line causes the error message
grid.fit(xtrain,ytrain)

Here is the error message produced by the above code:

Traceback (most recent call last):
  File "C:\Python27\test.py", line 23, in <module>
    grid.fit(xtrain,ytrain)
  File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 804, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
    for parameters in parameter_iterable
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
    self.results = batch()
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1524, in _fit_and_score
    X_train, y_train = _safe_split(estimator, X, y, train)
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1591, in _safe_split
    X_subset = safe_indexing(X, indices)
  File "C:\Python27\lib\site-packages\sklearn\utils\__init__.py", line 152, in safe_indexing
    return X.iloc[indices]
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1227, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1504, in _getitem_axis
    self._is_valid_list_like(key, axis)
  File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1443, in _is_valid_list_like
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

NOTE: It is important to me to keep X and y as Pandas datastructures, similar to the second approach presented in the other stackoverflow question above. i.e. I would not want to use X.values and y.values.

Question: Using the raw data as a Pandas datastructure (DataFrame for X and Series for y), is there a way to run grid.fit() without getting this error message?

Community
  • 1
  • 1
edesz
  • 11,756
  • 22
  • 75
  • 123
  • 1
    One issue in this script is that the CV object sss is producing indices for all of the rows in y. When you're calling grid.fit - you're giving it only xtrain, ytrain, which are shorter than y, therefore position indexers are out-of-bounds. Once you create sss you do not need to split the dataframes. Pass whole X and y into grid.fit and it does the splitting according to the indices from sss. – Keith Brodie Mar 14 '16 at 21:29
  • That's right, but I am also giving it `y_test`. The size of `X_test` matches that of `y_test`. Shouldn't this mean that position indexers match? – edesz Mar 15 '16 at 00:26
  • 1
    @WR, no, you are giving `y` to the `StratifiedShuffleSplit` and `xtrain, ytrain` to `grid.fit`. This is the root of the problem. – Igor Raush Mar 15 '16 at 00:32

1 Answers1

3

You should pass X and y directly to fit(), like

grid.fit(X, y)

and GridSearchCV will take care of

xtrain, xtest = X.iloc[train_index], X.iloc[test_index]
ytrain, ytest = y[train_index], y[test_index]

The StratifiedShuffleSplit instance, when iterated over, yields pairs of train/test split indices:

>>> list(sss)
[(array([2531, 4996, 4998, ..., 3205, 2717, 4983]), array([5942,  893, 1702, ..., 6340, 4806, 2537])),
 (array([1888, 2332, 6276, ..., 1674,  775, 3705]), array([3404, 3304, 4741, ..., 4397, 3646, 1410])),
 (array([1517, 3759, 4402, ..., 5098, 4619, 4521]), array([1110, 4076, 1280, ..., 6384, 1294, 1132]))]

GridSearchCV will use these indices to split the training samples. There is no need for you to do it manually.

The error occurs because you are feeding xtrain and ytrain (one of the train/test splits) into the cross-validator. The cross-validator tries to access items which exist in the full dataset but not in the train/test split, which raises an IndexError.

Igor Raush
  • 15,080
  • 1
  • 34
  • 55
  • But then how does one create a hold out set? Without splitting, there is no way to get `x_test`, `y_test`. Or is there another method to create a hold out dataset for `X` and `y`? – edesz Mar 15 '16 at 00:24
  • Cross validation implies iterating through a number of train/test splits (in your case, 3), and averaging the score. `GridSearchCV` automatically does this. Why do you need another hold-out set? – Igor Raush Mar 15 '16 at 00:27
  • If you really want to hold out a chunk of data, you can use something like `train_test_split`, which just produces a single split of the data (rather than an iterable of splits). In the end, you need to make sure to instantiate `StratifiedShuffleSplit` with the `y` which comes from the `X, y` tuple which you are passing to `fit(...)`. – Igor Raush Mar 15 '16 at 00:30
  • Okay, the `train_test_split` part is making sense to me - use `train_test_split()` to create a testing (i.e. hold out) set and use the training set in `grid.fit()` as `grid.fit(X_train,y_train)`. Does that follow the correct rationale for using a hold out set? – edesz Mar 15 '16 at 00:46
  • 1
    What you described will work if you **also** do `StratifiedShuffleSplit(y_train, ...)`. My question is, what do you plan to do with `X_test, y_test`? – Igor Raush Mar 15 '16 at 01:08
  • 1
    Wouldn't you train the classifier with `X_train`, `y_train` and then evaluate the classifier using `X_test`, `y_test`? – edesz Mar 15 '16 at 01:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106301/discussion-between-igor-raush-and-w-r). – Igor Raush Mar 15 '16 at 01:10