2

I want to one-hot encode in R through tidyverse, and not use packages such as caret, mltools, etc.

## Load vcd package
library(vcd)

## Load Arthritis dataset (data frame)
data(Arthritis)


Arthritis[1:5, ][2:5]

  Treatment  Sex Age Improved
1   Treated Male  27     Some
2   Treated Male  29     None
3   Treated Male  30     None
4   Treated Male  32   Marked
5   Treated Male  46   Marked

Is there an easy way to do this in tidyverse where I keep n-1 of the values for each categorical column? For example Sex is binary in this dataset so I would only need a one-hot encoded column for either Male or Female. The age feature would be ignored.

Eisen
  • 1,697
  • 9
  • 27
  • 1
    I don't fully understand what you mean by *"keep n-1 of the values"*. Would you mind being explicit and showing your expected output for whatever code we can come up with? Thanks! – r2evans Aug 31 '21 at 16:14
  • For example, Sex has two values (Male and Female) I would only want to encode it as 1 column (n - 1 = 2 -1 = 1) and it could be named Sex_Male for example. – Eisen Aug 31 '21 at 16:17
  • Are you talking about adding dummy variables? – r2evans Aug 31 '21 at 16:18
  • Yes essentially dummy variables, sorry for the confusion – Eisen Aug 31 '21 at 16:18
  • I marked this as a dupe (in order to keep the first other-question percolating to the top), but it's a still good question. Feel free to accept one of the answers. – r2evans Aug 31 '21 at 17:38

4 Answers4

2

For your specific example you could do this:

library(dplyr)

Arthritis |> 
  as_tibble() |> # not necessary, just using it for output readability
  mutate(sex_male = as.numeric(Sex) - 1)
#> # A tibble: 84 × 6
#>       ID Treatment Sex     Age Improved sex_male
#>    <int> <fct>     <fct> <int> <ord>       <dbl>
#>  1    57 Treated   Male     27 Some            1
#>  2    46 Treated   Male     29 None            1
#>  3    77 Treated   Male     30 None            1
#>  4    17 Treated   Male     32 Marked          1
#>  5    36 Treated   Male     46 Marked          1
#>  6    23 Treated   Male     58 Marked          1
#>  7    75 Treated   Male     59 None            1
#>  8    39 Treated   Male     59 Marked          1
#>  9    33 Treated   Male     63 None            1
#> 10    55 Treated   Male     63 None            1
#> # … with 74 more rows

This only works because Sex is a factor variable with two levels/distinct values. More complex variables will need more attention, unless you are willing to use a function from a package.

You are asking for a tidyverse solution. The recipes package is part of tidymodels.

library(recipes)

Arthritis |> 
  recipe(Improved ~ .) |> 
  step_dummy(Sex, Treatment) |> 
  prep() |> 
  bake(Arthritis)
#> # A tibble: 84 × 5
#>       ID   Age Improved Sex_Male Treatment_Treated
#>    <int> <int> <ord>       <dbl>             <dbl>
#>  1    57    27 Some            1                 1
#>  2    46    29 None            1                 1
#>  3    77    30 None            1                 1
#>  4    17    32 Marked          1                 1
#>  5    36    46 Marked          1                 1
#>  6    23    58 Marked          1                 1
#>  7    75    59 None            1                 1
#>  8    39    59 Marked          1                 1
#>  9    33    63 None            1                 1
#> 10    55    63 None            1                 1
#> # … with 74 more rows
Till
  • 3,845
  • 1
  • 11
  • 18
2

You could use a combination of pivot_longer and pivot_wider for this.

Arthritis %>%
  as_tibble() %>% # not neccessary, for better viewing
  mutate(across(everything(), as.character)) %>% 
  pivot_longer(c(Sex, Treatment, Improved), names_to = 'variable', values_to = 'value') %>% # specify the columns to encode here
  mutate(ind = 1) %>%
  unite(col_name, variable, value) %>%
  pivot_wider(values_from = ind, names_from = col_name, values_fill = 0)

For the n-1, once the data is in long format, you could filter out one of the values

long_format <- Arthritis %>%
  as_tibble() %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(c(Sex, Treatment, Improved), names_to = 'variable', values_to = 'value') %>%
  mutate(ind = 1)

# for the n-1
values_to_keep <- long_format %>%
  count(variable, value) %>%
  group_by(variable) %>%
  slice(-1) %>%
  pull(value)

long_format %>%
  filter(value %in% values_to_keep) %>%
  unite(col_name, variable, value) %>%
  pivot_wider(values_from = ind, names_from = col_name, values_fill = 0)

# A tibble: 78 x 6
   ID    Age   Sex_Male Treatment_Treated Improved_Some Improved_None
   <chr> <chr>    <dbl>             <dbl>         <dbl>         <dbl>
 1 57    27           1                 1             1             0
 2 46    29           1                 1             0             1
 3 77    30           1                 1             0             1
 4 17    32           1                 1             0             0
 5 36    46           1                 1             0             0
 6 23    58           1                 1             0             0
 7 75    59           1                 1             0             1
 8 39    59           1                 1             0             0
 9 33    63           1                 1             0             1
10 55    63           1                 1             0             1
Jeff Parker
  • 1,809
  • 1
  • 18
  • 28
1

I agree with Till that recipes is the way to go here. But if you want a solution strictly from the tidyverse, you could do something like this:

library(vcd)
library(tidyverse)


Arthritis %>%
  as_tibble() %>%
  mutate(d = map_dfc(unique(Improved) %>%
                        set_names(.),
                      ~ Improved == .x
                      ) %>% 
           .[-1]
         )
#> # A tibble: 84 × 6
#>       ID Treatment Sex     Age Improved d$None $Marked
#>    <int> <fct>     <fct> <int> <ord>    <lgl>  <lgl>  
#>  1    57 Treated   Male     27 Some     FALSE  FALSE  
#>  2    46 Treated   Male     29 None     TRUE   FALSE  
#>  3    77 Treated   Male     30 None     TRUE   FALSE  
#>  4    17 Treated   Male     32 Marked   FALSE  TRUE   
#>  5    36 Treated   Male     46 Marked   FALSE  TRUE   
#>  6    23 Treated   Male     58 Marked   FALSE  TRUE   
#>  7    75 Treated   Male     59 None     TRUE   FALSE  
#>  8    39 Treated   Male     59 Marked   FALSE  TRUE   
#>  9    33 Treated   Male     63 None     TRUE   FALSE  
#> 10    55 Treated   Male     63 None     TRUE   FALSE  
#> # … with 74 more rows
shs
  • 3,683
  • 1
  • 6
  • 34
1

You may be able to use just model.matrix for this. I've altered your sample data a little to ensure there are 2 or more levels for all:

dat <- structure(list(Treatment = c("Treated", "Treated", "UnTreated", "Treated", "Treated"), Sex = c("Male", "Male", "Male", "FeMale", "Male"), Age = c(27L, 29L, 30L, 32L, 46L), Improved = c("Some", "None", "None", "Marked", "Marked")), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
dat
#   Treatment    Sex Age Improved
# 1   Treated   Male  27     Some
# 2   Treated   Male  29     None
# 3 UnTreated   Male  30     None
# 4   Treated FeMale  32   Marked
# 5   Treated   Male  46   Marked

From there,

isnum <- sapply(dat, is.numeric)
iscat <- !isnum & lengths(lapply(dat, unique)) > 1
paste("~ 0 +", paste(names(dat)[iscat], collapse = " + "))
# [1] "~ 0 + Treatment + Sex + Improved"
cbind(dat[, !iscat, drop=FALSE],
      model.matrix(formula(paste("~ 0 +", paste(names(dat)[iscat], collapse = " + "))), data = dat))
#   Age TreatmentTreated TreatmentUnTreated SexMale ImprovedNone ImprovedSome
# 1  27                1                  0       1            0            1
# 2  29                1                  0       1            1            0
# 3  30                0                  1       1            1            0
# 4  32                1                  0       0            0            0
# 5  46                1                  0       1            0            0
r2evans
  • 141,215
  • 6
  • 77
  • 149