0

Assume I have this data set. Please let me know if it is a duplicate but I am confused in this.

library(tidymodels)

mt <- mtcars[,c('mpg', 'hp', 'drat', 'am')]

mt$hp <- as.character(mt$hp)
mt$drat <- as.character(mt$drat)

dp_pipe1=recipe(mpg ~ hp + drat + am,data=mt) %>% 
  update_role(c(hp,
                drat),new_role="to_numeric") %>% 
  step_mutate_at(has_role('to_numeric'), fn= as.numeric)

dp_pipe2=prep(dp_pipe1)
bake(dp_pipe2, NULL)

if you run the last step of bake, you will realise that the value of drat has been changed , in the actual data it was 3.9, 3.9, 3.85 etc but now it is coming like 16, 16, 15 etc. Note I am doing a forced character conversion on mtcars data just to show that I am doing a char to num conversion in the processing of data.

I am sorry if I am mistaken on doc. But unable to understand this. Please help

Note my data has no factors:

EDIT 2:

> glimpse(mt)
Rows: 32
Columns: 4
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3…
$ hp   <chr> "110", "110", "93", "110", "175", "105",…
$ drat <chr> "3.9", "3.9", "3.85", "3.08", "3.15", "2…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

if I run this:

dp_pipe1=recipe(mpg ~ hp + drat + am,data=mt) %>% 
  update_role(c(hp,
                drat),new_role="to_numeric") %>% 
  step_mutate_at(has_role('to_numeric'), fn= function(x)as.numeric(as.character(x)))

dp_pipe2=prep(dp_pipe1)
bake(dp_pipe2, NULL)

The code gives right result.

EDIT 1:

I am not sure if it is bug or not, but if we choose fn = function(x)as.numeric(as.character(x)) in the step_mutate_at, it works fine.

PKumar
  • 10,971
  • 6
  • 37
  • 52
  • What is happening is probably [this](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information) and therefore this is a probable duplicate. – Rui Barradas Jul 18 '23 at 11:18
  • @RuiBarradas , no but my data is not in factors. It is all characters – PKumar Jul 18 '23 at 11:19
  • 3
    But in regression, there is no difference between a character variable and a factor. I haven't investigated, but I suspect that, under the hood, `recipe(mpg ~ hp + drat + am,data=mt)` is converting from character to factor. Then, when you undo your conversion, `has_role('to_numeric'), fn= as.numeric` works on the (new) factor levels not the (old) factor labels. I believe OP's Edit 1 supposrts this hypothesis. – Limey Jul 18 '23 at 11:42
  • 2
    Yes, what @Limey is saying is what I think is happening. If needed, R's modeling functions coerce to factor automatically, `lm`, `glm`, mixed models (in packages lmer, lme4) and many others. – Rui Barradas Jul 18 '23 at 11:46
  • The thing is that I am not running any regression here. All I am using is a formula object wrapped inside recipe, does this mean that recipe object works similar to what lm or glm does by converting character to factor internally? Although I am still going through doc of recipe doc, as soon as I get any info I will update the question with resolution – PKumar Jul 18 '23 at 15:09

1 Answers1

3

For 99% of modeling situations, factor encodings are better than character encodings for qualitative data. For that reason, recipes will convert characters to factors. There is a prep() option (strings_as_factors) to avoid this.

What you are getting for drat is the integer that is the factor level index.

Here's an example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

drat_0 <- mtcars$drat
drat_1 <- as.character(drat_0)
drat_2 <- factor(drat_1)
drat_3 <- as.numeric(drat_2)

tibble(drat_0, drat_1, drat_2, drat_3) %>% str()
#> tibble [32 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ drat_0: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ drat_1: chr [1:32] "3.9" "3.9" "3.85" "3.08" ...
#>  $ drat_2: Factor w/ 22 levels "2.76","2.93",..: 16 16 15 5 6 1 7 11 17 17 ...
#>  $ drat_3: num [1:32] 16 16 15 5 6 1 7 11 17 17 ...

Created on 2023-07-18 with reprex v2.0.2

topepo
  • 13,534
  • 3
  • 39
  • 52
  • Thanks , this strings_as_factors = FALSE , solves the problem , you can highlight this in your answer. That would probably clear doubts for those who see this in future. Thanks again !!! – PKumar Jul 19 '23 at 03:56