1

I am working on a decision tree and setting the random state. However, the output is not reproducible.

I attached my code below:

df = pd.read_csv('inputfile.csv')

# Create training, validation, and testing data sets
train, test = train_test_split(df, test_size = 0.3, random_state = 4044)
val, test = train_test_split(test, test_size = 2/3, random_state = 4044)

train.reset_index(drop = True, inplace = True)
val.reset_index(drop = True, inplace = True)
test.reset_index(drop = True, inplace = True)


def regressionTree(train_df, val_df, depthParams, maxFeatParams):        
    
    # Create an empty dictionary to store the results
    modelDict = {}

    # Grid search over max_depth and max_features parameters and return the performance on the validation set
    for depth in depthParams:
        for max_feature in maxFeatParams:
            
            aTree = tree.DecisionTreeRegressor(max_depth = depth, max_features = max_feature, random_state = 42).fit(train_df.drop(['y'], axis = 1), train_df['y'])

            # Score the model on the validation data set
            y_pred = aTree.predict(val_df.drop(['y'], axis = 1))

            # Store results in a dictionary
            modelDict.update({str(depth) + ' ' + str(max_feature): {
                'model': aTree,
                'rmse': mean_squared_error(val_df['y'], y_pred, squared = False),
                'rsquared': aTree.score(val_df.drop(['y'], axis = 1), val_df['y'])
            }})

    return(modelDict)

step2out = regressionTree(train = train, val = val, depthParams = [3], maxFeatParams = [0.5, 0.75])

Here is my output when I run it the first time:

{'3 0.5': {'model': DecisionTreeRegressor(max_depth=3, max_features=0.5, random_state=42),
      'rmse': 0.22108214969064957,
      'rsquared': 0.13924080856472543},
     '3 0.75': {'model': DecisionTreeRegressor(max_depth=3, max_features=0.75, random_state=42),
      'rmse': 0.221547801229057,
      'rsquared': 0.13561106327008754}}

Here's the output after I clear my kernel and re-run the script

{'3 0.5': {'model': DecisionTreeRegressor(max_depth=3, max_features=0.5, random_state=42),
  'rmse': 0.22195369915849586,
  'rsquared': 0.13244086634306618},
 '3 0.75': {'model': DecisionTreeRegressor(max_depth=3, max_features=0.75, random_state=42),
  'rmse': 0.2215647793308301,
  'rsquared': 0.13547857497107196}}

Despite having the same random_state, the outputs are different. The nodes in the trees change, and the variables selected can differ drastically.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
mt1
  • 33
  • 5
  • Does this answer your question? [confused about random\_state in decision tree of scikit learn](https://stackoverflow.com/questions/39158003/confused-about-random-state-in-decision-tree-of-scikit-learn) – Nicholas Hansen-Feruch Jul 27 '22 at 00:07
  • @NicholasHansen-Feruch Not quite. I understand how random_state works, but it isn't reproducible even after I set random_state to a fixed integer. Look at my output to see that the RMSE and R-squared change. – mt1 Jul 27 '22 at 00:16

1 Answers1

2

I figured out that the order of the columns affects the decision tree. Adding the following lines of code within my function solved the issue:

train = train.reindex(sorted(train.columns), axis = 1)
val = val.reindex(sorted(val.columns), axis = 1)

Here is my edited full code:

df = pd.read_csv('inputfile.csv')

# Create training, validation, and testing data sets
train, test = train_test_split(df, test_size = 0.3, random_state = 4044)
val, test = train_test_split(test, test_size = 2/3, random_state = 4044)

train.reset_index(drop = True, inplace = True)
val.reset_index(drop = True, inplace = True)
test.reset_index(drop = True, inplace = True)


# Order the columns, which is necessary to replicate the results with a specific random_state
train = train.reindex(sorted(train.columns), axis = 1)
val = val.reindex(sorted(val.columns), axis = 1)

def regressionTree(train_df, val_df, depthParams, maxFeatParams):        

    # Create an empty dictionary to store the results
    modelDict = {}

    # Grid search over max_depth and max_features parameters and return the performance on the validation set
    for depth in depthParams:
        for max_feature in maxFeatParams:

            aTree = tree.DecisionTreeRegressor(max_depth = depth, max_features = max_feature, random_state = 42).fit(train_df.drop(['y'], axis = 1), train_df['y'])

            # Score the model on the validation data set
            y_pred = aTree.predict(val_df.drop(['y'], axis = 1))

            # Store results in a dictionary
            modelDict.update({str(depth) + ' ' + str(max_feature): {
                'model': aTree,
                'rmse': mean_squared_error(val_df['y'], y_pred, squared = False),
                'rsquared': aTree.score(val_df.drop(['y'], axis = 1), val_df['y'])
            }})

    return(modelDict)

step2out = regressionTree(train = train, val = val, depthParams = [2, 3], maxFeatParams = [0.5, 0.666, 0.75])
desertnaut
  • 57,590
  • 26
  • 140
  • 166
mt1
  • 33
  • 5