Can I specify specific rows for sklearn's train_test_split ? I need to know which rows are test data

Question

I'm using the famous jet engine data set to perform RUL predictions, I'm comparing different types of regressions, everything is OK.

I've been comfortably using sklearn's train_test_split, setting it to 0.3 which is what I want, BUT, I need to know which rows are being used as training and as testing split, because I need to use them for something else. Does this make sense? Are they being fixed? Or are they being interchanged and cross validated, and I'm not understanding?

I think my doubt is data wrangling only, but also want to know if it meddles with the models somehow.

My dataset shape is (20631, 17)

Some relevant code:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

Making the split

X_train, X_test, y_train, y_test = train_test_split(train_df.drop(columns = ["RUL", "unit"], axis=1), train_df["RUL"], test_size=0.3, random_state=42)

Linear regression

from sklearn.linear_model import LinearRegression
LM = LinearRegression()
LM.fit(X_train, y_train)

Decision tree regression

from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(random_state = 42)
DT_random_grid = {'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
               'max_features': ["auto", "sqrt", "log2"]}
DT_gs  = RandomizedSearchCV(estimator = DT, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=DT_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True)
DT_gs.fit(X_train,y_train)
DT = DT_gs.best_estimator_

Random forest regression

from sklearn.ensemble import RandomForestRegressor
RF = model = RandomForestRegressor(criterion="mse", random_state = 42, verbose = 1)
RF_random_grid = {'n_estimators': range(10, 300),
               'max_features': ['auto', 'sqrt', 'log2'],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5)}
RF_gs  = RandomizedSearchCV(estimator = RF, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=RF_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True, verbose = 1)
RF_gs.fit(X_train,y_train)
RF = RF_gs.best_estimator_

Gradient boosted tree regression

from sklearn.ensemble import GradientBoostingRegressor
GB =  GradientBoostingRegressor(random_state = 42)
GB_random_grid = {'n_estimators': range(10, 300),
               'learning_rate': [0.01, 0.05, 0.1, 0.2],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
                 'max_depth': range(2,8)}
GB_gs  = RandomizedSearchCV(estimator = GB, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=GB_random_grid,n_iter=82,cv=5,iid=True,return_train_score =True, verbose = 1)
GB_gs.fit(X_train,y_train)
GB = GB_gs.best_estimator_

Thanks!

I believe this is answered [here](https://stackoverflow.com/questions/31521170/scikit-learn-train-test-split-with-indices) — Matthew Barlowe, Apr 02 '20 at 05:57
Hmm... I think I just needed to print X_train and X_test and that's all. Then save that into a CSV for what I want — xcentralx, Apr 02 '20 at 06:23

Can I specify specific rows for sklearn's train_test_split ? I need to know which rows are test data

0 Answers0