0

I need to add a column containing gene names to a dataframe with information about isoform abundance. I have two tables.

DF 1 (contains ensemble gene ID #s as the rownames and asorted isoform abundance values in about 15 additional columns)

                              event_name              sample1_posterior_mean                        
gene:ENSMUSG00000079523       0.93,0.02,0.00,0.06     0.90,0.01,0.00,0.04
gene:ENSMUSG00000078572       0.78                    0.67
gene:ENSMUSG00000022548       0.63                    0.25

DF 2 (contains 3 columns ensemble gene ID #s and gene names)

          Ensemble_Transcript_ID  Ensemble_Gene_ID     External_Gene_ID
2335      ENSMUST00000101973      ENSMUSG00000096659            Gm25679
2336      ENSMUST00000179019      ENSMUSG00000095915           n-R5s115
2337      ENSMUST00000183908      ENSMUSG00000099299            Gm27722
2338      ENSMUST00000044752      ENSMUSG00000039481               Nrtn
2339      ENSMUST00000179157      ENSMUSG00000095476            Gm25077

I would like to add the External_Gene_ID column from DF 2 to the appropriate Ensemble_Gene_ID column in DF 1. I know there is a way to merge these two data frames together based on a column of interest

I hope i have explained this in sufficient detail. Thank you for your help!

Paul
  • 656
  • 1
  • 8
  • 23
  • 1
    "I know there is a way to merge these two data frames together based on a column of interest" - this begs the question - why haven't you tried it then? – SymbolixAU Nov 06 '16 at 21:36
  • Try this example, we can merge by rownames, too. `df1 <- mtcars[, 1:2]; df2 <- mtcars[, 3:4]; df2$myCol <- rownames(mtcars); merge(df1, df2, by.x = "row.names", by.y = "myCol")`, see [here](http://stackoverflow.com/questions/7739578/merge-data-frames-based-on-rownames-in-r) for more info. – zx8754 Nov 06 '16 at 22:21

1 Answers1

0

A dplyr solution (maybe) is to first create the Ensemble_Gene_ID column in DF1, and then use dplyr to join based on that column. The left join is such that items from DF2 that match DF1 will become part of the new DF, but those gene ID's unique to DF2 will not be (i.e., DF1 is sort of the template that you care about, such that the output will maintain the original nrow of DF1).

DF1$Ensemble_Gene_ID <- sapply(rownames(DF1), 
                          function(x){unlist(strsplit(x,':'))[2]})})

DF3 <- dplyr::left_join(DF1, DF2, by=c('Ensemble_Gene_ID'='External_Gene_ID'))

but I'm unsure if this works without an example chunk of each data.frame.

Tad Dallas
  • 1,179
  • 5
  • 13
  • I think something close to this is likely to work. But the situation is complicated that the gene IDs in DF1 exist in column 0 (the rownames column) which doesn't have a header text label. – Paul Nov 06 '16 at 21:52
  • Yes, I saw that after I answered. I edited my answer to create the column named `Ensemble_Gene_ID` in DF1 (the `sapply` statement). Hopefully this is on the correct path. Perhaps add a MWE containing some small subset of `DF1` and `DF2`? – Tad Dallas Nov 06 '16 at 21:54
  • The first part of your answer worked well to add the Ensemble_Gene_ID column to DF1. However i ran into problems with the second command which I can't seem to get it to work. I consistently get the following error using any of the dplyr *_join commands. 'Error: 'Ensemble_Gene_ID' column not found in lhs, cannot join' I assume lhs means Left hand side..? – Paul Nov 06 '16 at 22:25
  • Have you run that first line that create the Ensemble_Gene_ID column based on rownames? – Tad Dallas Nov 06 '16 at 22:43
  • Yes i have run that first line and it does successfully create the Ensemble_Gene_ID column in DF1 as intended. I verified this by looking at the modified DF1. – Paul Nov 06 '16 at 23:04
  • Typo in the code. Sorry about that. Fixed now. `by` argument error. – Tad Dallas Nov 06 '16 at 23:10
  • 1
    Thank you tad! there was one more change that had to be made which i will record for anyone else finding this post. Ultimately the final 2nd command looks like this. `DF3 <- dplyr::left_join(DF1, DF2, by=c('Ensemble_Gene_ID'='Ensemble_Gene_ID'))` This is because we are joining the spreadsheets by matching the Ensemble_Gene_ID columns in both DF1 and DF2. Using Ensemble_Gene_ID and External_Gene_ID as the basis to match results in empty columns. – Paul Nov 07 '16 at 14:44