1

I built a linear model with the sklearn based on the Cement and Concrete Composites dataset.

Initially, i used the train_test_split(X, Y, test_size=0.3, Shuffle=False) and i found the train and test error.

Now i try to run the same model 10 times with Shuffle=True and compute the mean and sd of the errors. The new results should be compared to the first ones.

How could i loop the same model n times and save the errors in a list?

Gvasiles
  • 85
  • 6

2 Answers2

0

Try something like this:

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression 

errors = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, Shuffle=True)
    model = LinearRegression() # the model you want to use here
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    error = accuracy_score(y_test, y_pred) # the error metric you want to use here
    errors.append(error)
bertslike
  • 101
  • 5
0

What you need is cross-validation: repeated evaluation of the model on different splits of the same data. train_test_split in this case is a wrapper around ShuffleSplit cross-validation.

In your case it might look like this:

from sklearn.model_selection import ShuffleSplit, cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression

X, y = ... # read dataset

model = LinearRegression()

# n_splits=10 is for 10 random shuffled train-test splits
cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)

scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
np.mean(scores), np.std(scores) 

If you want to compute the error on your own or do anything else with models/results, you could do it like this:

for train_ids, test_ids in cv.split(X):
    model.fit(X[train_ids], y[train_ids])
    model.score(X[test_ids], y[test_ids])
    ...

More about this: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

Bohdan I
  • 106
  • 3