1

It is not clear to me at what point I should apply scaling on my data, and how should I do that. Also, is the process the same for supervised and unsupervised learning, is it the same for regression, classification and Neural Networks?

First way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

features = scaler.fit_transform(features)

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

Second way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

Third way:

df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]

scaler = StandardScaler()

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Or maybe something fourth?

Also, I have some samples that I want to use for prediction, those samples are not in df, what should I do with those data, should I do:

samples = scaler.fit_transform(samples)

or:

samples = scaler.transform(samples)
taga
  • 3,537
  • 13
  • 53
  • 119

1 Answers1

4
  1. Split the data into train/test.
  2. Normalize train data with mean and standart deviation of training data set.
  3. Normalize test data with AGAIN mean and standart deviation of TRAINING DATA set.

In the real-world you cannot know the distribution of the test set. So you need to work with distribution of your training set.

Batuhan B
  • 1,835
  • 4
  • 29
  • 39
  • so I should '.fit_transform' my training features and only '.transflorm' my testing features, right? Also, I should perform only ".transform" on my validation data? – taga Mar 31 '20 at 21:49
  • 1
    Yes, on training set you should use fit_transform and for your test set you should only use your transform method. In your validation set, also you need to use only transform method. – Batuhan B Mar 31 '20 at 21:51
  • Thanks, can you maybe help me with this question: https://stackoverflow.com/questions/60931790/big-difference-between-val-acc-and-prediction-accuracy-in-keras-neural-network – taga Mar 31 '20 at 21:53
  • you are welcome, let me check it that question. – Batuhan B Mar 31 '20 at 21:55