0

I am studying random forests with some data I collected. I tested my classifier and was getting an accuracy of about 89% on my test set. However when I scaled my data to zero mean and unit variance, my accuracy dropped by almost 50%. I came across this post which seem to suggest I don't need to scale the data to get optimal performance.

Could anybody shed some light on what could be the possible reasons for such a significant drop in accuracy?

Edit : I am using sklearn.ensemble for my random forest implemententation

Here's a link to data

Community
  • 1
  • 1
Ajit
  • 667
  • 2
  • 14
  • 27

1 Answers1

0

Whether your random forest is invariant with respect to some transformation of the input features solely depends on your error functional. In short, when your functional is invariant under shifting and scaling, your model is it as well.

After shortly browsing through the help page here it seems that the standard functional used seems to be deviance loss. This funcional is not invariant under scaling of the input features, which explains your observation.

davidhigh
  • 14,652
  • 2
  • 44
  • 75
  • I have a feeling it has something with the data. I also tried support vector machine which oddly enough gave better classification with unscaled data. – Ajit Oct 20 '14 at 09:44