It is not clear to me at what point I should apply scaling on my data, and how should I do that. Also, is the process the same for supervised and unsupervised learning, is it the same for regression, classification and Neural Networks?
First way:
df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
scaler = StandardScaler()
features = scaler.fit_transform(features)
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
Second way:
df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
scaler = StandardScaler()
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
Third way:
df = pd.read_csv("mydata.csv")
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
scaler = StandardScaler()
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Or maybe something fourth?
Also, I have some samples that I want to use for prediction, those samples are not in df
, what should I do with those data, should I do:
samples = scaler.fit_transform(samples)
or:
samples = scaler.transform(samples)