4

I am trying to use supervised machine learning to predict the weight of crop (e.g. potatoes) from their respective length and width measures. Before fitting a specific model (e.g. linear regression), I want to perform a stratified sample of my features based on the frequency of a specific crop variety in my data set. For example, if I split my data in 5 partitions (i.e. I use cross validation) and variety1 accounts for 50% of my observations, 50% of the observations in each partitioned training set should correspond to variety1. This is the code I have tried in Python using sklearn (version 0.23):

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression

# build pd.DataFrame
varieties = np.concatenate([np.repeat("variety1", 10), 
                            np.repeat("variety2", 30), 
                            np.repeat("variety3", 60)])
columns = {"variety": varieties,
           "length": np.random.randint(30, 70, size=100),
           "width": np.random.randint(40, 50, size=100),
           "weight": np.random.random(100)*100 + 50}

df = pd.DataFrame(columns)

# stratified sampling
kf = StratifiedShuffleSplit(n_splits=5, test_size=0.2)

# fit model based on a cv splitter
lm = LinearRegression()
X = df.loc[:,"length":"width"]
y = df["weight"]
y_pred = cross_val_predict(lm, X, y, cv=kf.split(X, df["variety"]))

However, when I run this code I get the following error:

ValueError: cross_val_predict only works for partitions

This is a bit surprising for me because according to the documentation of sklearn we can use a splitter in the cv argument of cross_val_predict. I know that I can use a for loop to accomplish what I want:

kf = StratifiedShuffleSplit(n_splits=5, test_size=0.2)
X = df.loc[:,"length":"width"]
y = df["weight"]
y_pred = np.zeros(y.size)
for train_idx, test_idx in kf.split(X, df["variety"]):
    #get subsets of variables from CV
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    #fit model
    lm.fit(X_train, y_train)
    pred_vals = lm.predict(X_test)
    
    #store predicted values
    y_pred[test_idx] = pred_vals 

However, I would prefer to use a cross_val_predict to make the code a bit more compact. Is it possible?

1 Answers1

4

Try using StratifiedKFold instead of StratifiedShuffleSplit.

The difference is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times, the test sets can overlap and some data partitions are never a part of the test dataset, meaning there are no predictions for them.

You can read more at Catbuilts's explanation

Matin Zivdar
  • 593
  • 5
  • 20