0

I've been writing some code for credit card fraud detection problem using Scikit learn. I used train_test_split to split my data into training, test and valaidation data set.

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=123)

I don't understand why random_state here is 123 while splitting data between training and test data sets and

part_x_train, x_val, part_y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=2)

here random_state is 2 while splitting data between training and validation data sets. Why there is so much difference? I've been trying with different random_states but can't figure out a difference.

AcK
  • 2,063
  • 2
  • 20
  • 27
  • 3
    Have you read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) about this function? The reason to use the parameter is defined there: "Controls the shuffling applied to the data before applying the split. _Pass an int for reproducible output across multiple function calls_." (emphasis added) After reading that, what is your specific question? – G. Anderson Apr 20 '21 at 21:45

1 Answers1

1

The train_test_split function shuffles the rows of the original data and then takes a proportion to make the training dataset and the rest for testing.

So if train_size = 0.7, the function will shuffle your data and save 70 per cent of the shuffled data for training and 30 per cent for testing.

If you run train_test_split(x, y, train_size=0.7) without declaring a random state, the resulting split will (almost) always be different.

The reason we set a random state is to tell the function to shuffle the data identically every time in order to make our results replicable.

In other words, if you run train_test_split(x, y, train_size=0.7, random_state=123), you will always get the same result.

As for your code, note that the data you are splitting also changed in the second line. Here is your code with comments:

# Divide `x` and `y` in 70% train and 30% test
#    Note that you are splitting `x` and `y`        ▼  ▼
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    train_size=0.7,
                                                    random_state=123)

# Split the 70% into 80% train and 20% validation
#    Note that you are not splitting `x` and `y` anymore      ▼        ▼
part_x_train, x_val, part_y_train, y_val = train_test_split(x_train, y_train,
                                                            test_size=0.2,
                                                            random_state=2)

Note that in the second split you are splitting x_train and y_train.

This means your code takes 70% of the original data to create a training dataset and then splits that new subset in 80% for training and 20% for validation.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76