5

I have been exploring the new recipes package for variable transformations as part of a machine learning pipeline. I opted for this approach - upgrading from using caret's preProcess function, due to all the new extensions. But I am finding that the packages give very different results for the transformed data:

library(caret) # V6.0-79
library(recipes) # V0.1.2
library(MASS) # V7.3-47
# transform variables using recipes
rec_box <- recipe(~ ., data = as.data.frame(state.x77)) %>% 
  step_BoxCox(., everything()) %>% 
  prep(., training = as.data.frame(state.x77)) %>% 
  bake(., as.data.frame(state.x77)) 

> head(rec_box)
# A tibble: 6 x 8
  Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area
       <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl>
1       8.19   138.     0.647   60171653.   6.89      651.   20.  56.0
2       5.90   185.     0.376   61218586.   5.52     1632.  152. 106. 
3       7.70   155.     0.527   66409311.   4.08     1253.   15.  69.4
4       7.65   133.     0.570   66885876.   5.05      609.   65.  56.4
5       9.96   165.     0.0936  71570875.   5.13     1445.   20.  75.5
6       7.84   161.    -0.382   73188251.   3.62     1503.  166.  67.7

# transform variables using preProcess
pre_box <- preProcess(x = as.data.frame(state.x77), method = c('BoxCox')) %>% 
  predict(. ,newdata = as.data.frame(state.x77)) 

> head(pre_box)
    # A tibble: 6 x 8
      Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area
           <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl>
    1       8.19   118.     0.642       2383.   6.83      618.   20.  38.7
    2       5.90   157.     0.374       2401.   5.47     1538.  152.  65.7
    3       7.70   133.     0.524       2488.   4.05     1183.   15.  46.3
    4       7.65   114.     0.566       2496.   5.01      579.   65.  38.9
    5       9.96   141.     0.0935      2571.   5.09     1363.   20.  49.7
    6       7.84   138.    -0.383       2596.   3.60     1418.  166.  45.4


## Subtract recipe transformations from MARS::boxcox via caret::preProcess
colMeans(rec_box - pre_box)

> colMeans(rec_box - pre_box)
  Population       Income   Illiteracy     Life Exp       Murder      HS Grad        Frost         Area 
0.000000e+00 2.215800e+01 2.515464e-03 6.803437e+07 2.638715e-02 5.883549e+01 0.000000e+00 1.745788e+01

So it would seem that on some columns they do agree, but others are way different. Any reason why these transformations might be so very different? Anyone else been finding the similar discrepancies?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hanjo Odendaal
  • 1,395
  • 2
  • 13
  • 32

1 Answers1

3

The difference is due to rounding of lambdas in the preProcess function which are rounded to one decimal place.

Check this example:

library(caret) 
library(recipes) 
library(MASS)
library(mlbench)
data(Sonar)

df <- Sonar[,-61]

using the preProcess function and setting fudge to 0 (no tolerance for 0/1 coercion of lambdas).

z2 <- preProcess(x = as.data.frame(df), method = c('BoxCox'), fudge = 0)

and using recepies:

z <- recipe(~ ., data = as.data.frame(df )) %>% 
  step_BoxCox(., everything()) %>% 
  prep(., training = as.data.frame(df))

lets check the lambdas for recepies:

z$steps[[1]]$lambdas
#output
        V1         V2         V3         V4         V5         V6         V7         V8         V9        V10        V11        V12 
0.09296796 0.23383117 0.19487939 0.11471259 0.18688851 0.35852835 0.48787887 0.36830343 0.26340880 0.29810673 0.33913896 0.50361765 
       V13        V14        V15        V16        V17        V18        V19        V20        V21        V22        V23        V24 
0.49178396 0.35997958 0.43900093 0.28981749 0.22843441 0.27016373 0.50573719 0.83436868 1.02366629 1.15194335 1.35062142 1.44484148 
       V25        V26        V27        V28        V29        V30        V31        V32        V33        V34        V35        V36 
1.51851127 1.61365888 1.47445453 1.44448827 1.22132457 1.00145613 0.66343491 0.61951328 0.53028496 0.45278118 0.39019507 0.37536033 
       V37        V38        V39        V40        V41        V42        V52        V53        V54        V55        V56        V57 
0.28428050 0.23439217 0.29554367 0.47263000 0.34455069 0.44036919 0.15240917 0.30314637 0.28647186 0.16202628 0.27153385 0.17005357 
       V58        V59        V60 
0.15688906 0.28761156 0.06652761 

and the lambdas for preProcess:

sapply(z2$bc, function(x) x$lambda)
#output
 V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 
0.1 0.2 0.2 0.1 0.2 0.4 0.5 0.4 0.3 0.3 0.3 0.5 0.5 0.4 0.4 0.3 0.2 0.3 0.5 0.8 1.0 1.2 1.4 1.4 1.5 1.6 1.5 1.4 1.2 1.0 0.7 0.6 0.5 0.5 
V35 V36 V37 V38 V39 V40 V41 V42 V52 V53 V54 V55 V56 V57 V58 V59 V60 
0.4 0.4 0.3 0.2 0.3 0.5 0.3 0.4 0.2 0.3 0.3 0.2 0.3 0.2 0.2 0.3 0.1 

So:

df$V1^z$steps[[1]]$lambdas[1]

is not equal to

df$V1^sapply(z2$bc, function(x) x$lambda)[1]

With default fudge = 0.2 the difference will be even higher since -0.2 - 02 will be changed to 0 ie log transformation while 0.8 - 1.2 lambdas will be changed to 1 - no transformation.

I would not concern myself with these differences both functions will reduce the skewness of data. Just don't mix them in the same training pipeline.

Also to get more unbiased estimates of performance these transformations should be performed during re-sampling and not prior it to avoid data leakage.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • what is very interesting about the example is that its not restricted to BoxCox transformations, I am finding this for almost every other `recipe` vs `preProcess` comparison. I wonder what general rule is applied in the `caret` package - it would be quite frustrating if there is an arbitrary rounding rule, because where do you start to look? In my actual workflow I use 8 different transformations and in terms of performance, using `preProcess` results in much better test ROCs, than the `recipe` transformations – Hanjo Odendaal May 15 '18 at 07:59
  • It looks to me that it is: `round(x, digits = 1)`. How much better is the test ROC? It might be just random improvement, if you were to use another test set perhaps `recipe's` lambdas would perform better? – missuse May 15 '18 at 08:24
  • With `preProcess`'s methods `BoxCox, `YeoJohnson` and `spatialSign` and no data transformations, I get ROCs ranging 0.88 - 0.92 (mean improvement of 0.03 for transformations). where with `recipes` all transformed datasets are ~0.55 and original gives me ~0.88. Obviously this is my use case (caret::data(segmentationData)). It just seems too far off – Hanjo Odendaal May 15 '18 at 08:35
  • There must be something else going on. If you wish post another question describing the problem in a reproducible way and I will look in to it a bit later. – missuse May 15 '18 at 08:38
  • Let me comb through the million lines of code to see. If I can reproduce the dissimilarity in results, then ill post a link to a new question. Otherwise, we can consider this question closed – Hanjo Odendaal May 15 '18 at 09:46