0

I want to aggregate some columns into one using the value.

I have a matrix like this :

| X  | Y1 | Y2 | Y3 | Y4 |
|----|----|----|----|----|
|id1 | 0  | 0  | 1  | 0  |
|id2 | 0  | 1  | 0  | 0  |
|id3 | 0  | 1  | 0  | 0  |
|id4 | 0  | 0  | 0  | 1  |
|id5 | 0  | 1  | 0  | 0  |
|id6 | 1  | 0  | 0  | 0  |

And I want to retrieve a matrix like this :

| X  | Y  |
|----|----|
|id1 | Y3 |
|id2 | Y2 |
|id3 | Y2 |
|id4 | Y4 |
|id5 | Y2 |
|id6 | Y1 |

I don't know how i can do this.

Thank you for your help.

Sotos
  • 51,121
  • 6
  • 32
  • 66
C. Bd
  • 73
  • 5
  • Have a look at tidyr's 'gather' function, there'll be many examples on SO. https://www.rdocumentation.org/packages/tidyr/versions/0.8.0/topics/gather DF %>% gather(key = X, value = Y, -X) – Russ Hyde Feb 17 '18 at 09:37

3 Answers3

4

Here is a vectorized method using max.col from base R,

names(df)[max.col(df[-1])+1]
#[1] "Y3" "Y2" "Y2" "Y4" "Y2" "Y1"

To construct your output, you can simply use data.frame,

data.frame(X = df$X, Y = names(df)[max.col(df[-1])+1])

which gives,

    X  Y
1 id1 Y3
2 id2 Y2
3 id3 Y2
4 id4 Y4
5 id5 Y2
6 id6 Y1

A bit more simplified version of max.col, since It works with logical statements as well, can be (compliments of @Jaap)

names(df)[max.col(df == 1)]
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

One way with tidyr and dplyr:

library(dplyr)
library(tidyr)

df %>% 
  gather(key, value, -X) %>% 
  filter(value == 1) %>% 
  arrange(X)

Returns:

# A tibble: 6 x 3
  X     key   value
  <chr> <chr> <chr>
1 id1   Y3    1    
2 id2   Y2    1    
3 id3   Y2    1    
4 id4   Y4    1    
5 id5   Y2    1    
6 id6   Y1    1   

Data:

df <- data.frame(stringsAsFactors=FALSE,
                 X = c( "id1", "id2", "id3", "id4", "id5", "id6"),
                 Y1 = c( "0", "0", "0", "0", "0", "1"),
                 Y2 = c("0", "1", "1", "0", "1", "0"),
                 Y3 = c("1", "0", "0", "0", "0", "0"),
                 Y4 = c("0", "0", "0", "1", "0", "0"))
tyluRp
  • 4,678
  • 2
  • 17
  • 36
2

Here is a base R solution. I use the following data.frame for illustration

df <- data.frame(X = c("id1", "id2", "id3", "id4"),
                 Y1 = c(1, 0, 0, 0),
                 Y2 = c(0, 1, 0, 1),
                 Y3 = c(0, 0, 1, 0))
df
#    X Y1 Y2 Y3
#1 id1  1  0  0
#2 id2  0  1  0
#3 id3  0  0  1
#4 id4  0  1  0

In the first step, for every row I will look for the position of the column that contains the value 1.

col_positions <- sapply(X = 1:nrow(df), FUN = function(x) which(df[x, ] == 1))
col_positions
#[1] 2 3 4 3

Now you can use this vector to filter colnames(df) and append the result to your data.

df$Y <- colnames(df)[col_positions]
df
#    X Y1 Y2 Y3  Y
#1 id1  1  0  0 Y1
#2 id2  0  1  0 Y2
#3 id3  0  0  1 Y3
#4 id4  0  1  0 Y2

If you only want to keep columns X and Y, you can select them for example as follows

df[, colnames(df) %in% c("X", "Y")]
markus
  • 25,843
  • 5
  • 39
  • 58