0

I have dataset where many variables are actually kind of "one hot encoded", and I would like to collapse it to have a single variable with the value.

  name  born_2017 born_2018 born_2019
  <chr>     <dbl>     <dbl>     <dbl>
1 Paul          0         1         0
2 Diane         0         0         1
3 Jose          1         0         0

And I want it to look like that :

  name  birth_year
  <chr> <chr>     
1 Paul  born_2018 
2 Diane born_2019 
3 Jose  born_2017

I watched around dplyr and tidyr but I somehow didn't find what I need.

ps: I have to do this for a lot of variables so an easily generalizable solution, or working with the pipe, would be very helpful

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Haezer
  • 457
  • 5
  • 15

2 Answers2

1

You can use gather

library(dplyr)
df %>%
  gather(birth_year ,flag , born_2017:born_2018) %>%
  filter(flag == 1) %>%
  select(-flag)
Sonny
  • 3,083
  • 1
  • 11
  • 19
  • I thought about that but the fact I have to use a flag, adding many rows to the code since I have to do it for many variables, is quite boring – Haezer Apr 10 '19 at 08:31
0
example <- read.table(text = "
name  born_2017 born_2018 born_2019

 Paul          0         1         0
 Diane         0         0         1
 Jose          1         0         0", h = T)

In this particular example, this could as well work with just base R:

example$birth_year <- colnames(example[,2:4])[apply(example[,2:4], 1, which.max)]

example[,c("name", "birth_year")]
   name birth_year
1  Paul  born_2018
2 Diane  born_2019
3  Jose  born_2017

Based on Sotos suggestions, the following two approaches are vectorized, do not need apply and are more dense, and are therefore preferable:

subset(cbind(example[1], stack(example[-1])), values == 1) 

or

 names(example[-1])[max.col(example[-1])]
Lennyy
  • 5,932
  • 2
  • 10
  • 23
  • 2
    There are vectorized ways to do it via base R that do not require `apply`. For example, `subset(cbind(df[1], stack(df[-1])), values == 1)` or following your thought, simply, `names(df[-1])[max.col(df[-1])]` – Sotos Apr 10 '19 at 08:32
  • That is even way better indeed, thanks a lot :) I hope you don't mind I updated my answer – Lennyy Apr 10 '19 at 08:37