Accuracy of preprocessing single sample

Question

I've been working to predict samples with the sklearn implementation of KNN.

So far i've been training my classifier with a sample of my dataset, and then testing it with another distinct sample of the dataset and appearing to see an accuracy of around 98%.

However, when attempting to predict a single sample the predictions are all over the place even when using samples the model has been trained on. The only guess i have is that there is a problem when preprocessing the entire dataset with preprocessing.scale versus preprocessing a single sample with the same technique.

I've read Preprocessing in scikit learn - single sample - Depreciation warning and am wondering if there is a correct way to preprocess a single sample.

EDIT: Code for preprocessing shown below For the whole dataset:

self.trainData = preprocessing.scale(self.trainData)

For a single sample, where log is of the same form as samples in traindata.

log = preprocessing.scale(log)

Show the code, how are you processing the single sample. You should use the same scales that were used in training. If you call `scale()` on individual sample, it would give wrong results. — Vivek Kumar, Apr 06 '18 at 05:15
I've added code, if i get incorrect results when using scale() on a single sample, how would i preprocess a single sample correctly? — styresd, Apr 06 '18 at 05:29

score 1 · Accepted Answer · answered Apr 06 '18 at 05:55

You should use StandardScaler which is a wrapper over the scale function as described here. This wrapper stores the mean and standard deviation learned from the training data and then uses this information to scale the other data.

Example usage:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

trainData = scaler.fit_transform(trainData)
# I have used reshape because of single sample. In other cases, its not needed
log = scaler.transform(np.reshape(log, (1,-1)))

fit_transform() is just a shortcut for first calling fit() and then transform().

fit() method does not return anything. It just analyses the data to learn the mean and standard_deviation. transform() will use the learnt mean and std to scale the data and returns the new data.

You should only call fit() or fit_transform() on the training data,never on anything else. For transforming the test or new data, always use transform().

Awesome, that solves the issue, and ended up improving the accuracy of the model a bit as well. — styresd, Apr 06 '18 at 17:48

Accuracy of preprocessing single sample

1 Answers1