3

I've got this two column database that lists gene codes and then biological pathway. Within the database some gene codes are linked to multiple biological pathways:

      A           B
    396139  mesonephros development    
    396139  camera-type eye development  
    396139  Sertoli celldevelopment

I'm trying to get rid of these repeats, while moving each biological function to a new column:

  A       B                         C                           D
396139    mesonephros development   camera-type eye development Sertoli celldevelopment

I've tried a few macros in Excel, but have been unsuccessful in making anything constructive. I'm also a little new to R so I have no idea where I would start to format this. Any help in either software would be much appreciated.

This question is different from the claimed duplicate because they are trying to combine columns when I require them to be separate. The answer in this question is also simpler and does not require an external package and is, therefore, worth keeping separate.

ephackett
  • 249
  • 1
  • 15

1 Answers1

1

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Gened.Code', we paste the elements in 'Organ.Developmental.Effect' together. The toString is a wrapper for paste(., collapse=', ').

library(data.table)
setDT(df1)[, list(Col= toString(Organ.Developmental.Effect)) , Gene.Code]
#   Gene.Code
#1:        11
#2:        19
#3:        37
#4:       674
#5:      2033
#6:     2-Sep
#7:     5-Sep
#8:    396139
#                                                                             Col
#1:                                        eye photoreceptor cell differentiation
#2:                                        eye photoreceptor cell differentiation
#3:                                        eye photoreceptor cell differentiation
#4:                                           larval salivary gland morphogenesis
#5:                                                    compound eye morphogenesis
#6:                                                     imaginal disc development
#7:                                                     imaginal disc development
#8: metanephros development, mesonephros development, camera-type eye development
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I got an error: Error in `[.data.table`(setDT(DevAmigo), , list(biolpwy = paste(biolpwy, : The items in the 'by' or 'keyby' list are length (1,1). Each must be same length as rows in x or number of rows returned by i (10000). – ephackett Nov 24 '15 at 15:14
  • @ephackett You mentioned two columns in the dataset, right. So, I assumed the `geneid` is the first column – akrun Nov 24 '15 at 15:15
  • I changed it to the data.frame[1] in that section. Should I rename the column and attach the column names? – ephackett Nov 24 '15 at 15:16
  • @ephackett Can you show the `dput` of the first few rows, ie. `dput(droplevels(head(yourdataset, 10)))` – akrun Nov 24 '15 at 15:18
  • Sure thing: structure(list(Gene.Code = structure(c(1L, 2L, 5L, 8L, 4L, 3L, 7L, 6L, 6L, 6L), .Label = c("11", "19", "2-Sep", "2033", "37", "396139", "5-Sep", "674"), class = "factor"), Organ.Developmental.Effect = structure(c(3L, 3L, 3L, 5L, 2L, 4L, 4L, 7L, 6L, 1L), .Label = c("camera-type eye development", "compound eye morphogenesis", "eye photoreceptor cell differentiation", "imaginal disc development", "larval salivary gland morphogenesis", "mesonephros development", "metanephros development"), class = "factor")), .Names = c("Gene.Code", "Organ.Developmental.Effect"), – ephackett Nov 24 '15 at 15:20
  • @ephackett I think it is not the full output. Can you update your post with the dput output. – akrun Nov 24 '15 at 15:22
  • structure(list(Gene.Code = structure(c(1L, 2L, 5L, 8L, 4L, 3L, 7L, 6L, 6L, 6L), .Label = c("11", "19", "2-Sep", "2033", "37", "396139", "5-Sep", "674"), class = "factor"), Organ.Developmental.Effect = structure(c(3L, 3L, 3L, 5L, 2L, 4L, 4L, 7L, 6L, 1L), .Label = c("camera-type eye development", "compound eye morphogenesis", "eye photoreceptor cell differentiation", "imaginal disc development", "larval salivary gland morphogenesis", "mesonephros development", "metanephros development"), class = "factor")), .Names = c("Gene.Code", "Organ.Developmental.Effect") (1/2) – ephackett Nov 24 '15 at 15:24
  • , row.names = c(NA, -10L), .internal.selfref = , class = c("data.table", "data.frame")) (2/2) – ephackett Nov 24 '15 at 15:24
  • @ephackett It is working for me – akrun Nov 24 '15 at 15:29
  • 1
    It did! Thanks so much for your patience with a R newbie! – ephackett Nov 24 '15 at 15:35