library(tidyverse)
df <- tibble(
id = paste0("Person", 1:3),
class_label = c(TRUE, FALSE, FALSE),
attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K")
)
df
#> # A tibble: 3 x 3
#> id class_label attribute
#> <chr> <lgl> <chr>
#> 1 Person1 TRUE A469/T593/K022K
#> 2 Person2 FALSE A835/Z935/T593
#> 3 Person3 FALSE U835F/W5326/K022K
df %>%
separate_rows(attribute, sep = "/") %>%
mutate(i = 1) %>%
spread(attribute, i, fill = 0)
#> # A tibble: 3 x 9
#> id class_label A469 A835 K022K T593 U835F W5326 Z935
#> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Person1 TRUE 1 0 1 1 0 0 0
#> 2 Person2 FALSE 0 1 0 1 0 0 1
#> 3 Person3 FALSE 0 0 1 0 1 1 0
Taking note that what you are trying to do is frequently referred to as one-hot encoding
or binary encoding
. In addition, you may want to note that df %>% separate_rows(attribute, sep = "/")
your data is in a tidy format which could have other benefits.
UPDATE:
To extend to more columns, you might want to first determine which attributes you will be encoding. So something like, select(df, contains("attribute"))
or select(df, 3:4)
.
df <- tibble(
id = paste0("Person", 1:3),
class_label = c(TRUE, FALSE, FALSE),
attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K"),
attribute2 = c("one/two/three", "four/five/six", "one/five/six")
)
df
#> # A tibble: 3 x 4
#> id class_label attribute attribute2
#> <chr> <lgl> <chr> <chr>
#> 1 Person1 TRUE A469/T593/K022K one/two/three
#> 2 Person2 FALSE A835/Z935/T593 four/five/six
#> 3 Person3 FALSE U835F/W5326/K022K one/five/six
one_hot <- function(data, att) {
quo_att <- enquo(att)
data %>%
select(id, class_label, !! quo_att) %>%
separate_rows(!! quo_att, sep = "/") %>%
mutate(i = 1) %>%
spread(!! quo_att, i, fill = 0) %>%
select(-id, -class_label)
}
attributes_to_map <- select(df, contains("attribute")) %>% names
attributes_to_map
#> [1] "attribute" "attribute2"
attributes_to_map %>%
map_dfc(~ one_hot(df, .)) %>%
bind_cols(select(df, id, class_label)) %>%
select(id, class_label, everything())
#> # A tibble: 3 x 15
#> id class_label A469 A835 K022K T593 U835F W5326 Z935 five four
#> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Pers~ TRUE 1 0 1 1 0 0 0 0 0
#> 2 Pers~ FALSE 0 1 0 1 0 0 1 1 1
#> 3 Pers~ FALSE 0 0 1 0 1 1 0 1 0
#> # ... with 4 more variables: one <dbl>, six <dbl>, three <dbl>, two <dbl>
But at this point, you may also want to consider the recipes package or do searches on one-hot encoding for several variables.