0

I'm using the famous jet engine data set to perform RUL predictions, I'm comparing different types of regressions, everything is OK.

I've been comfortably using sklearn's train_test_split, setting it to 0.3 which is what I want, BUT, I need to know which rows are being used as training and as testing split, because I need to use them for something else. Does this make sense? Are they being fixed? Or are they being interchanged and cross validated, and I'm not understanding?

I think my doubt is data wrangling only, but also want to know if it meddles with the models somehow.

My dataset shape is (20631, 17)

Some relevant code:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

Making the split

X_train, X_test, y_train, y_test = train_test_split(train_df.drop(columns = ["RUL", "unit"], axis=1), train_df["RUL"], test_size=0.3, random_state=42)

Linear regression

from sklearn.linear_model import LinearRegression
LM = LinearRegression()
LM.fit(X_train, y_train)

Decision tree regression

from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(random_state = 42)
DT_random_grid = {'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
               'max_features': ["auto", "sqrt", "log2"]}
DT_gs  = RandomizedSearchCV(estimator = DT, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=DT_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True)
DT_gs.fit(X_train,y_train)
DT = DT_gs.best_estimator_

Random forest regression

from sklearn.ensemble import RandomForestRegressor
RF = model = RandomForestRegressor(criterion="mse", random_state = 42, verbose = 1)
RF_random_grid = {'n_estimators': range(10, 300),
               'max_features': ['auto', 'sqrt', 'log2'],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5)}
RF_gs  = RandomizedSearchCV(estimator = RF, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=RF_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True, verbose = 1)
RF_gs.fit(X_train,y_train)
RF = RF_gs.best_estimator_

Gradient boosted tree regression

from sklearn.ensemble import GradientBoostingRegressor
GB =  GradientBoostingRegressor(random_state = 42)
GB_random_grid = {'n_estimators': range(10, 300),
               'learning_rate': [0.01, 0.05, 0.1, 0.2],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
                 'max_depth': range(2,8)}
GB_gs  = RandomizedSearchCV(estimator = GB, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=GB_random_grid,n_iter=82,cv=5,iid=True,return_train_score =True, verbose = 1)
GB_gs.fit(X_train,y_train)
GB = GB_gs.best_estimator_

Thanks!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
xcentralx
  • 15
  • 6
  • 1
    I believe this is answered [here](https://stackoverflow.com/questions/31521170/scikit-learn-train-test-split-with-indices) – Matthew Barlowe Apr 02 '20 at 05:57
  • Hmm... I think I just needed to print X_train and X_test and that's all. Then save that into a CSV for what I want – xcentralx Apr 02 '20 at 06:23

0 Answers0