"Revert" one hot encoding

Question

I have dataset where many variables are actually kind of "one hot encoded", and I would like to collapse it to have a single variable with the value.

  name  born_2017 born_2018 born_2019
  <chr>     <dbl>     <dbl>     <dbl>
1 Paul          0         1         0
2 Diane         0         0         1
3 Jose          1         0         0

And I want it to look like that :

  name  birth_year
  <chr> <chr>     
1 Paul  born_2018 
2 Diane born_2019 
3 Jose  born_2017

I watched around dplyr and tidyr but I somehow didn't find what I need.

ps: I have to do this for a lot of variables so an easily generalizable solution, or working with the pipe, would be very helpful

@NelsonGon That's not the same problem, I don't want to get more row at the end. — Haezer, Apr 10 '19 at 08:30
As it's currently written, it's the same problem. What happens to the 0s and 1s?! — NelsonGon, Apr 10 '19 at 08:33

score 1 · Accepted Answer · answered Apr 10 '19 at 08:25

1

You can use gather

library(dplyr)
df %>%
  gather(birth_year ,flag , born_2017:born_2018) %>%
  filter(flag == 1) %>%
  select(-flag)

answered Apr 10 '19 at 08:25

Sonny

3,083
1
11
19

I thought about that but the fact I have to use a flag, adding many rows to the code since I have to do it for many variables, is quite boring – Haezer Apr 10 '19 at 08:31

Lennyy · Answer 2 · 2019-04-10T08:39:04.647

0

example <- read.table(text = "
name  born_2017 born_2018 born_2019

 Paul          0         1         0
 Diane         0         0         1
 Jose          1         0         0", h = T)

In this particular example, this could as well work with just base R:

example$birth_year <- colnames(example[,2:4])[apply(example[,2:4], 1, which.max)]

example[,c("name", "birth_year")]
   name birth_year
1  Paul  born_2018
2 Diane  born_2019
3  Jose  born_2017

Based on Sotos suggestions, the following two approaches are vectorized, do not need apply and are more dense, and are therefore preferable:

subset(cbind(example[1], stack(example[-1])), values == 1)

or

 names(example[-1])[max.col(example[-1])]

edited Apr 10 '19 at 08:39

answered Apr 10 '19 at 08:27

Lennyy

5,932
2
10
23

2

There are vectorized ways to do it via base R that do not require `apply`. For example, `subset(cbind(df[1], stack(df[-1])), values == 1)` or following your thought, simply, `names(df[-1])[max.col(df[-1])]` – Sotos Apr 10 '19 at 08:32
That is even way better indeed, thanks a lot :) I hope you don't mind I updated my answer – Lennyy Apr 10 '19 at 08:37

"Revert" one hot encoding

2 Answers2