Compare different rows and remove repeated values

Question

    Category                 Genes
"Tissue morphology"       "AKT, TGF1B, IFNG, IgG, Igm"
"Tissue morphology"       "ELOVL3, EREG, FABP5, FOXP3, glycerol, GSTO1, HDAC1"
"Tissue morphology"       "AKT, FABPS, Igm"
"Cell growth"             "AICDA, BID, CD200R1, CD36, CSF2"
"Cell growth"             "5-hydroxytryptamine, adenosine triphosphate, AICDA"

I have several tables which consists of several rows with identical values ("Categroy") but some different/identical values in the column “Genes”. I would like to merge all rows with same "Category" together into one single row and keeps unique one and remove repeats. Is there a better way to solve this issue? I have tried "intersect" and "merge". But, not really clean and easy. I've looked for an answer but haven't found anything yet, so I’d be very thankful for any help!

    Category                 Genes
"Tissue morphology"         "AKT, TGF1B, IFNG, IgG, Igm, ELOVL3, EREG, FABPS, FABP5, FOXP3, glycerol, GSTO1, HDAC1"
"Cell growth"               "5-hydroxytryptamine, adenosine triphosphate, AICDA, BID, CD200R1, CD36, CSF2"

akrun · Accepted Answer · 2015-02-23T15:20:08.367

For this, we could use data.table (other options include aggregate, dplyr etc). Convert the "data.frame" to "data.table" (setDT(df1)), grouped by "Category", split the column "Genes" (strsplit), unlist, sort, and paste together (toString is a wrapper for paste(., collapse=", "))

library(data.table)
DT1 <- setDT(df1)[, list(Genes=toString(sort(unique(unlist(strsplit(Genes, 
                ', ')))))), by=Category]

DT1$Genes
#[1] "AKT, ELOVL3, EREG, FABP5, FABPS, FOXP3, glycerol, GSTO1, HDAC1, IFNG, IgG, Igm, TGF1B"
#[2] "5-hydroxytryptamine, adenosine triphosphate, AICDA, BID, CD200R1, CD36, CSF2"

Or the duplicate elements can be removed without splitting the string, but it will not be sorted alphabetically

DT2 <- setDT(df1)[, list(Genes=gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*), ', '',
           toString(Genes), perl=TRUE)), Category]

DT2$Genes
#[1] "TGF1B, IFNG, IgG, ELOVL3, EREG, FABP5, FOXP3, glycerol, GSTO1, HDAC1, AKT, FABPS, Igm"
#[2] "BID, CD200R1, CD36, CSF2, 5-hydroxytryptamine, adenosine triphosphate, AICDA"

Compare different rows and remove repeated values

1 Answers1