0

I have a data.frame similar to:

> df<-data.frame(ID=1:5, Age30_40=c(1,0,0,1,0),Age40_50=c(0,1,1,0,0), Age50_60=c(0,0,0,0,1))
> df
  ID Age30_40 Age40_50 Age50_60
1  1        1        0        0
2  2        0        1        0
3  3        0        1        0
4  4        1        0        0
5  5        0        0        1

I want to convert this wide data to a long data such that my output is as follows:

ID  AgeRange
1   Age30_40
2   Age40_50
3   Age40_50
4   Age30_40
5   Age50_60

However, all that I could manage thus far is this:

> library(tidyr)
> gather(df, AgeRange, InThisRange, Age30_40:Age50_60)
   ID AgeRange InThisRange
1   1 Age30_40   1
2   2 Age30_40   0
3   3 Age30_40   0
4   4 Age30_40   1
5   5 Age30_40   0
6   1 Age40_50   0
7   2 Age40_50   1
8   3 Age40_50   1
9   4 Age40_50   0
10  5 Age40_50   0
11  1 Age50_60   0
12  2 Age50_60   0
13  3 Age50_60   0
14  4 Age50_60   0
15  5 Age50_60   1

Since this is a simplified example, I can loop through each entry and keep the only non-zero AgeRange entry corresponding to each ID by checking the boolean variable InThisRange. However, this is definitely not a very efficient solution for large data frames. Is there a simpler way to achieve this?

DotPi
  • 3,977
  • 6
  • 33
  • 53
  • Related: [For each row return the column name of the largest value](https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value) – Henrik Apr 10 '18 at 18:03

3 Answers3

1

We can use max.col from base R (assuming that there is only a single 1 per row)

cbind(df[1], AgeRange = names(df)[-1][max.col(df[-1])])
#  ID AgeRange
#1  1 Age30_40
#2  2 Age40_50
#3  3 Age40_50
#4  4 Age30_40
#5  5 Age50_60
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for the quick answer. Just a couple of things. 1. My actual dataframe has many other columns. This operation is deleting all other columns. 2. The number of rows are higher after the operation compared to the number of rows before. I think this is due to the two errors cbind is throwing regarding non-identical attributes and introducing NAs by coercion. Any suggestions? – DotPi Apr 10 '18 at 18:09
  • @DotPi You can subset those columns you need and `cbind` with the 'AgeRange' The second part is not clear as the `max.col` will output one value for each row – akrun Apr 11 '18 at 02:54
0

An apply based solution can be achieved using dynamic names of column.

#Columns to select to change in long format
cols <- grepl("^Age.*",names(df))

cbind(df[!cols], AgeRange = apply(df[cols],1,function(x)names(x[x!=0][1])))
#   ID AgeRange
# 1  1 Age30_40
# 2  2 Age40_50
# 3  3 Age40_50
# 4  4 Age30_40
# 5  5 Age50_60

Note: The above solution will consider only left-most column where multiple columns got 1 in same row.

MKR
  • 19,739
  • 4
  • 23
  • 33
0

How about this?

library(dplyr)

#add column names which you don't want to convert to long data
ignore_cols <- c("ID", "another_col")

df %>%
  select(-one_of(ignore_cols)) %>%
  mutate(AgeRange = names(.)[max.col(.)]) %>%
  select(AgeRange) %>%
  bind_cols(df[,ignore_cols])

Output is:

  AgeRange ID another_col
1 Age30_40  1         101
2 Age40_50  2         102
3 Age40_50  3         103
4 Age30_40  4         104
5 Age50_60  5         105

Sample data:

df <- structure(list(ID = 1:5, Age30_40 = c(1, 0, 0, 1, 0), Age40_50 = c(0, 
1, 1, 0, 0), Age50_60 = c(0, 0, 0, 0, 1), another_col = 101:105), .Names = c("ID", 
"Age30_40", "Age40_50", "Age50_60", "another_col"), row.names = c(NA, 
-5L), class = "data.frame")

Note: For the illustration purpose I have added one more column (i.e. another_col).

Prem
  • 11,775
  • 1
  • 19
  • 33