4

I have a set of documents and a set of labels. Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.

train=[]

with open("/Users/rte/Documents/Documents.txt") as f:
    for line in f:
        train.append(line.strip().split())

labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
    for line in t:
        labels.append(line.strip().split())

X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)

When I try the method provided in the documentation of scikit learn: I receive an error that says:

kf=KFold(len(train), n_folds=3)

for train_index, test_index in kf:
     X_train, X_test = train[train_index],train[test_index]
     y_train, y_test = labels[train_index],labels[test_index]

error

   X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index

How can I perform a 10 fold cross-validation on my documents and labels?

minks
  • 2,859
  • 4
  • 21
  • 29
  • What have you tried so far to get the Kfold cross-validation working? Have you seen the example on the [documentation page](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold)? – Robin Spiess Feb 03 '16 at 12:10
  • Yes I have tried the example given there on my documents and label set, but I receive an error: *X_train, X_test = train[train_index],train[test_index] TypeError: only integer arrays with one element can be converted to an index* – minks Feb 03 '16 at 14:47

1 Answers1

4

There are two ways to solve this error:

First way:

Cast your data to a numpy array:

import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)

then it should work with your current code.

Second way:

Use list comprehension to index the train & label list with the train_index & test_index list

for train_index, test_index in kf:
    X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
    y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]

(For this solution also see related question index list with another list)

Community
  • 1
  • 1
Robin Spiess
  • 1,480
  • 9
  • 17