Convert from Wide to Long format for mutually exclusive categories

Question

I have a data.frame similar to:

> df<-data.frame(ID=1:5, Age30_40=c(1,0,0,1,0),Age40_50=c(0,1,1,0,0), Age50_60=c(0,0,0,0,1))
> df
  ID Age30_40 Age40_50 Age50_60
1  1        1        0        0
2  2        0        1        0
3  3        0        1        0
4  4        1        0        0
5  5        0        0        1

I want to convert this wide data to a long data such that my output is as follows:

ID  AgeRange
1   Age30_40
2   Age40_50
3   Age40_50
4   Age30_40
5   Age50_60

However, all that I could manage thus far is this:

> library(tidyr)
> gather(df, AgeRange, InThisRange, Age30_40:Age50_60)
   ID AgeRange InThisRange
1   1 Age30_40   1
2   2 Age30_40   0
3   3 Age30_40   0
4   4 Age30_40   1
5   5 Age30_40   0
6   1 Age40_50   0
7   2 Age40_50   1
8   3 Age40_50   1
9   4 Age40_50   0
10  5 Age40_50   0
11  1 Age50_60   0
12  2 Age50_60   0
13  3 Age50_60   0
14  4 Age50_60   0
15  5 Age50_60   1

Since this is a simplified example, I can loop through each entry and keep the only non-zero AgeRange entry corresponding to each ID by checking the boolean variable InThisRange. However, this is definitely not a very efficient solution for large data frames. Is there a simpler way to achieve this?

Related: [For each row return the column name of the largest value](https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value) — Henrik, Apr 10 '18 at 18:03

score 1 · Answer 1 · answered Apr 10 '18 at 17:28

1

We can use max.col from base R (assuming that there is only a single 1 per row)

cbind(df[1], AgeRange = names(df)[-1][max.col(df[-1])])
#  ID AgeRange
#1  1 Age30_40
#2  2 Age40_50
#3  3 Age40_50
#4  4 Age30_40
#5  5 Age50_60

answered Apr 10 '18 at 17:28

akrun

874,273
37
540
662

Thanks for the quick answer. Just a couple of things. 1. My actual dataframe has many other columns. This operation is deleting all other columns. 2. The number of rows are higher after the operation compared to the number of rows before. I think this is due to the two errors cbind is throwing regarding non-identical attributes and introducing NAs by coercion. Any suggestions? – DotPi Apr 10 '18 at 18:09
@DotPi You can subset those columns you need and `cbind` with the 'AgeRange' The second part is not clear as the `max.col` will output one value for each row – akrun Apr 11 '18 at 02:54

score 0 · Answer 2 · answered Apr 10 '18 at 19:21

An apply based solution can be achieved using dynamic names of column.

#Columns to select to change in long format
cols <- grepl("^Age.*",names(df))

cbind(df[!cols], AgeRange = apply(df[cols],1,function(x)names(x[x!=0][1])))
#   ID AgeRange
# 1  1 Age30_40
# 2  2 Age40_50
# 3  3 Age40_50
# 4  4 Age30_40
# 5  5 Age50_60

Note: The above solution will consider only left-most column where multiple columns got 1 in same row.

score 0 · Answer 3 · answered Apr 10 '18 at 19:37

How about this?

library(dplyr)

#add column names which you don't want to convert to long data
ignore_cols <- c("ID", "another_col")

df %>%
  select(-one_of(ignore_cols)) %>%
  mutate(AgeRange = names(.)[max.col(.)]) %>%
  select(AgeRange) %>%
  bind_cols(df[,ignore_cols])

Output is:

  AgeRange ID another_col
1 Age30_40  1         101
2 Age40_50  2         102
3 Age40_50  3         103
4 Age30_40  4         104
5 Age50_60  5         105

Sample data:

df <- structure(list(ID = 1:5, Age30_40 = c(1, 0, 0, 1, 0), Age40_50 = c(0, 
1, 1, 0, 0), Age50_60 = c(0, 0, 0, 0, 1), another_col = 101:105), .Names = c("ID", 
"Age30_40", "Age40_50", "Age50_60", "another_col"), row.names = c(NA, 
-5L), class = "data.frame")

Note: For the illustration purpose I have added one more column (i.e. another_col).

Convert from Wide to Long format for mutually exclusive categories

3 Answers3