I have been exploring the new recipes
package for variable transformations as part of a machine learning pipeline. I opted for this approach - upgrading from using caret
's preProcess
function, due to all the new extensions. But I am finding that the packages give very different results for the transformed data:
library(caret) # V6.0-79
library(recipes) # V0.1.2
library(MASS) # V7.3-47
# transform variables using recipes
rec_box <- recipe(~ ., data = as.data.frame(state.x77)) %>%
step_BoxCox(., everything()) %>%
prep(., training = as.data.frame(state.x77)) %>%
bake(., as.data.frame(state.x77))
> head(rec_box)
# A tibble: 6 x 8
Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.19 138. 0.647 60171653. 6.89 651. 20. 56.0
2 5.90 185. 0.376 61218586. 5.52 1632. 152. 106.
3 7.70 155. 0.527 66409311. 4.08 1253. 15. 69.4
4 7.65 133. 0.570 66885876. 5.05 609. 65. 56.4
5 9.96 165. 0.0936 71570875. 5.13 1445. 20. 75.5
6 7.84 161. -0.382 73188251. 3.62 1503. 166. 67.7
# transform variables using preProcess
pre_box <- preProcess(x = as.data.frame(state.x77), method = c('BoxCox')) %>%
predict(. ,newdata = as.data.frame(state.x77))
> head(pre_box)
# A tibble: 6 x 8
Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8.19 118. 0.642 2383. 6.83 618. 20. 38.7
2 5.90 157. 0.374 2401. 5.47 1538. 152. 65.7
3 7.70 133. 0.524 2488. 4.05 1183. 15. 46.3
4 7.65 114. 0.566 2496. 5.01 579. 65. 38.9
5 9.96 141. 0.0935 2571. 5.09 1363. 20. 49.7
6 7.84 138. -0.383 2596. 3.60 1418. 166. 45.4
## Subtract recipe transformations from MARS::boxcox via caret::preProcess
colMeans(rec_box - pre_box)
> colMeans(rec_box - pre_box)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
0.000000e+00 2.215800e+01 2.515464e-03 6.803437e+07 2.638715e-02 5.883549e+01 0.000000e+00 1.745788e+01
So it would seem that on some columns they do agree, but others are way different. Any reason why these transformations might be so very different? Anyone else been finding the similar discrepancies?