I am trying to use supervised machine learning to predict the weight of crop (e.g. potatoes) from their respective length and width measures. Before fitting a specific model (e.g. linear regression), I want to perform a stratified sample of my features based on the frequency of a specific crop variety in my data set. For example, if I split my data in 5 partitions (i.e. I use cross validation) and variety1 accounts for 50% of my observations, 50% of the observations in each partitioned training set should correspond to variety1. This is the code I have tried in Python using sklearn (version 0.23):
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LinearRegression
# build pd.DataFrame
varieties = np.concatenate([np.repeat("variety1", 10),
np.repeat("variety2", 30),
np.repeat("variety3", 60)])
columns = {"variety": varieties,
"length": np.random.randint(30, 70, size=100),
"width": np.random.randint(40, 50, size=100),
"weight": np.random.random(100)*100 + 50}
df = pd.DataFrame(columns)
# stratified sampling
kf = StratifiedShuffleSplit(n_splits=5, test_size=0.2)
# fit model based on a cv splitter
lm = LinearRegression()
X = df.loc[:,"length":"width"]
y = df["weight"]
y_pred = cross_val_predict(lm, X, y, cv=kf.split(X, df["variety"]))
However, when I run this code I get the following error:
ValueError: cross_val_predict only works for partitions
This is a bit surprising for me because according to the documentation of sklearn we can use a splitter in the cv argument of cross_val_predict. I know that I can use a for loop to accomplish what I want:
kf = StratifiedShuffleSplit(n_splits=5, test_size=0.2)
X = df.loc[:,"length":"width"]
y = df["weight"]
y_pred = np.zeros(y.size)
for train_idx, test_idx in kf.split(X, df["variety"]):
#get subsets of variables from CV
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
#fit model
lm.fit(X_train, y_train)
pred_vals = lm.predict(X_test)
#store predicted values
y_pred[test_idx] = pred_vals
However, I would prefer to use a cross_val_predict to make the code a bit more compact. Is it possible?