0

I have a large dataset with one column with genes names, four columns with the detection methods (X1-X4) and three columns with type of mutation (Y5-Y7). I would like to merge the rows by the name of the gene and that the gene contain 1 whenever there is a 1 in one of the columns. Example of the table:

GENE   X1  X2  X3  X4  Y5  Y6  Y7
AKT1   1   0   0   0   0   1   0 
AKT1   0   0   1   0   0   0   1 
AKT1   0   0   1   0   0   1   0
CENPF  0   1   0   0   0   1   0
CENPF  0   0   1   0   0   1   0
FOXA1  1   0   0   0   0   1   0
FOXA1  0   1   0   0   0   1   0
KMT2C  0   1   0   0   1   0   0
KMT2C  0   0   1   0   1   0   0

Example of the table results using the information of the above table.

GENE   X1  X2  X3  X4  Y5  Y6  Y7
AKT1   1   0   1   0   0   1   1 
CENPF  0   1   1   0   0   1   0
FOXA1  1   1   0   0   0   1   0
KMT2C  0   1   1   0   1   0   0

Thanks for your help

  • Can you share what you already tried? Also, take a look at this [link](http://stackoverflow.com/questions/5963269) on how to ask questions. – Sven Dec 01 '20 at 13:11

3 Answers3

2

You can use rowsum to merge by GENE. rowsum sums up the values and with > 0 you get FALSE / TRUE in case it is larger than 0 and with + you get back values 0 or 1.

+(rowsum(x[-1], x$GENE) > 0)
#      X1 X2 X3 X4 Y5 Y6 Y7
#AKT1   1  0  1  0  0  1  1
#CENPF  0  1  1  0  0  1  0
#FOXA1  1  1  0  0  0  1  0
#KMT2C  0  1  1  0  1  0  0

Data:

x <- read.table(header=TRUE, text="
GENE   X1  X2  X3  X4  Y5  Y6  Y7
AKT1   1   0   0   0   0   1   0 
AKT1   0   0   1   0   0   0   1 
AKT1   0   0   1   0   0   1   0
CENPF  0   1   0   0   0   1   0
CENPF  0   0   1   0   0   1   0
FOXA1  1   0   0   0   0   1   0
FOXA1  0   1   0   0   0   1   0
KMT2C  0   1   0   0   1   0   0
KMT2C  0   0   1   0   1   0   0")
GKi
  • 37,245
  • 2
  • 26
  • 48
1

Does this work:

library(dplyr)
dat %>% group_by(GENE) %>% summarise(across(X1:Y7, ~ case_when(1 %in% . ~ 1, TRUE ~ 0)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 8
  GENE     X1    X2    X3    X4    Y5    Y6    Y7
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AKT1      1     0     1     0     0     1     1
2 CENPF     0     1     1     0     0     1     0
3 FOXA1     1     1     0     0     0     1     0
4 KMT2C     0     1     1     0     1     0     0

Data used:

dat
# A tibble: 9 x 8
  GENE     X1    X2    X3    X4    Y5    Y6    Y7
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AKT1      1     0     0     0     0     1     0
2 AKT1      0     0     1     0     0     0     1
3 AKT1      0     0     1     0     0     1     0
4 CENPF     0     1     0     0     0     1     0
5 CENPF     0     0     1     0     0     1     0
6 FOXA1     1     0     0     0     0     1     0
7 FOXA1     0     1     0     0     0     1     0
8 KMT2C     0     1     0     0     1     0     0
9 KMT2C     0     0     1     0     1     0     0
Karthik S
  • 11,348
  • 2
  • 11
  • 25
1

One way would be to take max for all the columns for each GENE.

This can be done in base R :

result <- aggregate(.~GENE, df, max, na.rm = TRUE)
result

#   GENE X1 X2 X3 X4 Y5 Y6 Y7
#1  AKT1  1  0  1  0  0  1  1
#2 CENPF  0  1  1  0  0  1  0
#3 FOXA1  1  1  0  0  0  1  0
#4 KMT2C  0  1  1  0  1  0  0

dplyr :

library(dplyr)
df %>% group_by(GENE) %>% summarise(across(X1:Y7, max, na.rm = TRUE))

and data.table :

library(data.table)
setDT(df)[, lapply(.SD, max), GENE, .SDcols = X1:Y7]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213