Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
78
votes
3 answers

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

As from the title I am wondering what is the difference between StratifiedKFold with the parameter shuffle=True StratifiedKFold(n_splits=10, shuffle=True, random_state=0) and StratifiedShuffleSplit StratifiedShuffleSplit(n_splits=10,…
68
votes
3 answers

scikit-learn cross validation, negative values with mean squared error

When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? from sklearn.svm import SVR from sklearn import…
ahmethungari
  • 2,089
  • 4
  • 19
  • 21
58
votes
2 answers

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute…
sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58
57
votes
6 answers

Using explicit (predefined) validation set for grid search with sklearn

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms. I would now like to optimize the parameters of my SVM…
pir
  • 5,513
  • 12
  • 63
  • 101
57
votes
6 answers

What is the difference between cross-validation and grid search?

In simple words, what is the difference between cross-validation and grid search? How does grid search work? Should I do first a cross-validation and then a grid search?
Linda
  • 2,375
  • 4
  • 30
  • 33
48
votes
6 answers

module 'sklearn' has no attribute 'cross_validation'

I am trying to split my dataset into training and testing dataset, but I am getting this error: X_train,X_test,Y_train,Y_test = sklearn.cross_validation.train_test_split(X,df1['ENTRIESn_hourly']) AttributeError Traceback…
Naren
  • 491
  • 1
  • 4
  • 4
46
votes
6 answers

How to split data on balanced training set and test set on sklearn

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function X_train, X_test, y_train, y_test =…
Jeanne
  • 1,241
  • 3
  • 19
  • 28
43
votes
6 answers

Sklearn StratifiedKFold: ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead

Working with Sklearn stratified kfold split, and when I attempt to split using multi-class, I received on error (see below). When I tried and split using binary, it works no problem. num_classes = len(np.unique(y_train)) y_train_categorical =…
jKraut
  • 2,325
  • 6
  • 35
  • 48
41
votes
4 answers

return coefficients from Pipeline object in sklearn

I've fit a Pipeline object with RandomizedSearchCV pipe_sgd = Pipeline([('scl', StandardScaler()), ('clf', SGDClassifier(n_jobs=-1))]) param_dist_sgd = {'clf__loss': ['log'], 'clf__penalty': [None, 'l1', 'l2',…
38
votes
8 answers

How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import…
Paul
  • 3,321
  • 1
  • 33
  • 42
37
votes
4 answers

Using statsmodel estimations with scikit-learn cross validation, is it possible?

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead. I am looking for a way I can use the fit object (result) ontained from python statsmodel to feed into…
CARTman
  • 717
  • 1
  • 6
  • 14
36
votes
2 answers

What is OOF approach in machine learning?

I have seen in many kaggle notebooks people talk about oof approach when they do machine learning with K-Fold validation. What is oof and is it related to k-fold validation ? Also can you suggest some useful resources for it to get the concept in…
Nikhil Mishra
  • 1,182
  • 2
  • 18
  • 34
36
votes
3 answers

Difference between cross_val_score and cross_val_predict

I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be : cvs = DecisionTreeRegressor(max_depth =…
35
votes
4 answers

How is scikit-learn cross_val_predict accuracy score calculated?

Does the cross_val_predict (see doc, v0.18) with k-fold method as shown in the code below calculate accuracy for each fold and average them finally or not? cv = KFold(len(labels), n_folds=20) clf = SVC() ypred = cross_val_predict(clf, td, labels,…
Roman
  • 3,007
  • 8
  • 26
  • 54
33
votes
3 answers

Early stopping with Keras and sklearn GridSearchCV cross-validation

I wish to implement early stopping with Keras and sklean's GridSearchCV. The working code example below is modified from How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras. The data set may be downloaded from here. The…
1
2 3
99 100