1

I've got data describing genes where I've got genes in duplicate. For those with duplicates I'd like to compress the information so no information is lost and all duplicate gene info combines into one row. I've seen similar questions (like How to combine duplicate rows in a data frame in R) but this is selecting the largest duplicate number, haven't found questions that generally keep duplicate info into one row.

For example I have data like this:

gene   pvalue   info
ACE     0.7     benign
ACE     0.001   pathogenic
ACE     0.5     benign
BRCA    0.01    benign
NOS     0.2     benign
NOS     0.003   pathogenic
NOS     0.57    benign

I want the duplicates to combine/compress into

gene   pvalue                info
ACE    0.7, 0.001, 0.5      benign, pathogenic,benign
BRCA   0.01                 benign 
NOS    0.2, 0.003, 0.57     benign, pathogenic, benign

The aim is after compression I will code for within numeric cells to select either the largest or smallest number for that gene.

Currently for compressing duplicate gene information I've tried using aggregate() but this requires a setting of FUN that I don't want to do and I don't know how to get around.

DN1
  • 234
  • 1
  • 13
  • 38

1 Answers1

2

Here's a way using data.table:

library(data.table)
setDT(df)[, pvalue := as.character(pvalue)][, pvalue := paste0(as.character(pvalue), collapse = ", "), by = gene][, info := paste0(info, collapse = ", "), by = gene]
unique(df)

#   gene           pvalue                        info
#1:  ACE  0.7, 0.001, 0.5 benign, pathogenic, benign
#2:  BRCA             0.01                     benign
#3:  NOS 0.2, 0.003, 0.57 benign, pathogenic, benign

data

df <- structure(list(gene = structure(c(1L, 1L, 1L, 2L, 3L, 3L, 3L), .Label = c("ACE","BRCA", "NOS"), class = "factor"), pvalue = c(0.7, 0.001, 0.5, 0.01, 0.2, 0.003, 0.57), info = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L), .Label = c("benign", "pathogenic"), class = "factor")), class = "data.frame", row.names = c(NA,-7L))
sm925
  • 2,648
  • 1
  • 16
  • 28