Scaling before or after splitting the data in Python Keras

Question

It is not clear to me at what point I should apply scaling on my data, and how should I do that. Also, is the process the same for supervised and unsupervised learning, is it the same for regression, classification and Neural Networks?

First way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

features = scaler.fit_transform(features)

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

Second way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Third way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Or maybe something fourth?

Also, I have some samples that I want to use for prediction, those samples are not in df, what should I do with those data, should I do:

samples = scaler.fit_transform(samples)

or:

samples = scaler.transform(samples)

score 4 · Accepted Answer · answered Mar 31 '20 at 21:48

4

Split the data into train/test.
Normalize train data with mean and standart deviation of training data set.
Normalize test data with AGAIN mean and standart deviation of TRAINING DATA set.

In the real-world you cannot know the distribution of the test set. So you need to work with distribution of your training set.

answered Mar 31 '20 at 21:48

Batuhan B

1,835
4
29
39

so I should '.fit_transform' my training features and only '.transflorm' my testing features, right? Also, I should perform only ".transform" on my validation data? – taga Mar 31 '20 at 21:49
1

Yes, on training set you should use fit_transform and for your test set you should only use your transform method. In your validation set, also you need to use only transform method. – Batuhan B Mar 31 '20 at 21:51
Thanks, can you maybe help me with this question: https://stackoverflow.com/questions/60931790/big-difference-between-val-acc-and-prediction-accuracy-in-keras-neural-network – taga Mar 31 '20 at 21:53
you are welcome, let me check it that question. – Batuhan B Mar 31 '20 at 21:55

Scaling before or after splitting the data in Python Keras

1 Answers1