1

In my dataset I have a binary Target (0 or 1) variable, and 8 features: nchar, rtc, Tmean, week_day, hour, ntags, nlinks and nex. week_day is a factor while the others are numeric. I built a decision tree classifier, but my question concerns the feature scaling:

library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-c(2,4)] = scale(training_set[-c(2,4)])
test_set[-c(2,4)] = scale(test_set[-c(2,4)])

The model returns that Tmean=-0.057 and ntags=2 are two splitting points. How can I recover the original value of these two features, that is, that assumed by the variables before the rescaling operation performed by scale().

Mark
  • 1,577
  • 16
  • 43

1 Answers1

1

If the data were scaled with scale, the following function unscale might be of help solving the question.
The original vector and the unscaled one are all.equal but not identical, due to floating-point precision.

unscale <- function(x){
  xbar <- attr(x, "scaled:center")
  se <- attr(x, "scaled:scale")
  if(is.null(xbar) & is.null(se)){
    x
  } else {
    y <- t(se * t(x) + xbar)
    attr(y, "scaled:center") <- NULL
    attr(y, "scaled:scale") <- NULL
    y
  }
}

set.seed(2020)
A <- matrix(rnorm(120, sd = 16), ncol = 5)
s <- scale(A)
identical(A, unscale(s))  #FALSE

zeros <- as.vector(A - unscale(s))
all.equal(zeros, rep(0, 120))  
#[1] TRUE

The function also works with data.frames but the class of its output is "matrix", not the original "data.frame". This is the result of scale's output.

B <- as.data.frame(matrix(A, ncol = 5))
s2 <- scale(B)
B2 <- as.data.frame(unscale(s2))
all.equal(B, B2)
#[1] TRUE

But the right way of scaling/unscaling an object with a dim attribute, such as a data.frame, is vector by vector. This can be done with a lapply loop, for instance.

s3 <- B
s3[] <- lapply(B, scale)

B3 <- s3
B3[] <- lapply(s3, unscale)
all(B - B3 < .Machine$double.eps^0.5)
#[1] TRUE
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you! But it seems that the function hasn't affected my dataframe. The values have not changed. – Mark Mar 25 '21 at 10:44
  • @Mark Have you assigned the result back to the df? – Rui Barradas Mar 25 '21 at 10:46
  • Yes, `training_set` and `test_set` are dataframe. However, the problem could be that `attr(training_set, "scaled:center")` returns `NULL`. I can't explain why. – Mark Mar 25 '21 at 10:55
  • @Mark In the code you've posted, `training_set` is not scaled. So, there is no reason for the `"scaled:*"` attributes to be set. – Rui Barradas Mar 25 '21 at 10:58
  • I'm so sorry, I forgot to include the feature scaling part. – Mark Mar 25 '21 at 11:35