I'm using the famous jet engine data set to perform RUL predictions, I'm comparing different types of regressions, everything is OK.
I've been comfortably using sklearn's train_test_split, setting it to 0.3 which is what I want, BUT, I need to know which rows are being used as training and as testing split, because I need to use them for something else. Does this make sense? Are they being fixed? Or are they being interchanged and cross validated, and I'm not understanding?
I think my doubt is data wrangling only, but also want to know if it meddles with the models somehow.
My dataset shape is (20631, 17)
Some relevant code:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
Making the split
X_train, X_test, y_train, y_test = train_test_split(train_df.drop(columns = ["RUL", "unit"], axis=1), train_df["RUL"], test_size=0.3, random_state=42)
Linear regression
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
LM.fit(X_train, y_train)
Decision tree regression
from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(random_state = 42)
DT_random_grid = {'min_samples_split': range(2, 10),
'min_samples_leaf': range(1, 5),
'max_features': ["auto", "sqrt", "log2"]}
DT_gs = RandomizedSearchCV(estimator = DT, n_jobs=-1, scoring = "neg_mean_squared_error",
param_distributions=DT_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True)
DT_gs.fit(X_train,y_train)
DT = DT_gs.best_estimator_
Random forest regression
from sklearn.ensemble import RandomForestRegressor
RF = model = RandomForestRegressor(criterion="mse", random_state = 42, verbose = 1)
RF_random_grid = {'n_estimators': range(10, 300),
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': range(2, 10),
'min_samples_leaf': range(1, 5)}
RF_gs = RandomizedSearchCV(estimator = RF, n_jobs=-1, scoring = "neg_mean_squared_error",
param_distributions=RF_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True, verbose = 1)
RF_gs.fit(X_train,y_train)
RF = RF_gs.best_estimator_
Gradient boosted tree regression
from sklearn.ensemble import GradientBoostingRegressor
GB = GradientBoostingRegressor(random_state = 42)
GB_random_grid = {'n_estimators': range(10, 300),
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'min_samples_split': range(2, 10),
'min_samples_leaf': range(1, 5),
'max_depth': range(2,8)}
GB_gs = RandomizedSearchCV(estimator = GB, n_jobs=-1, scoring = "neg_mean_squared_error",
param_distributions=GB_random_grid,n_iter=82,cv=5,iid=True,return_train_score =True, verbose = 1)
GB_gs.fit(X_train,y_train)
GB = GB_gs.best_estimator_
Thanks!