R- merge two dataframes but values of ID have semicolons

Question

This is a followup question to: R- merge two data frames but some values have semi colon in them which has been addressed by contributor: agstudy.

The actual data discussed in the link is a bit more complex and i have been stuck for a while.

This is what my dataframe (df2) looks like:

myIDColumn  someName    somevalue       
AB  gsdfg   123     
CD  tfgsdfg 234     
EF  sfdgsf  365     
GH  gdfgb   53453       
IJ  sr  64564       
KL  sfsdv   4234234     
MN  ewrwe   5       
OP  dsfsss  3453        
QR  gggg    667     
ST  dss 7567        
UV  hhhhjf  55      
WX  dfadasad    8657        
YZ  ghfgh   1234        
ABC gdgfg   234455      
VCB hgjkk   5555667

This is what my df1 looks like:

ID  someText    someThing       
AB  ada 12      
CD;EF;QR    dfsdf   13      
IJ  fgfgd   14      
KL  fgdg    15      
MN  gh  16      
OP;WX   jhjhj   17      
WW  ghjgjhgjghj 18      
YZ  kkl 19

This is what i was hoping to get as an output:

enter image description here

I can merge the two well by using:

mm <- merge(df2,df1,by.y='ID',by.x='myIDColumn',all.y=TRUE)

but after that no idea how to proceed further.

Any help is really appreciated. Thanks.

df1:

structure(list(ID = structure(1:8, .Label = c("AB", "CD;EF;QR", 
"IJ", "KL", "MN", "OP;WX", "WW", "YZ"), class = "factor"), someText = structure(c(1L, 
2L, 4L, 3L, 5L, 7L, 6L, 8L), .Label = c("ada", "dfsdf", "fgdg", 
"fgfgd", "gh", "ghjgjhgjghj", "jhjhj", "kkl"), class = "factor"), 
    someThing = 12:19), .Names = c("ID", "someText", "someThing"
), class = "data.frame", row.names = c(NA, -8L))

df2:

structure(list(myIDColumn = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 14L, 15L, 2L, 13L), .Label = c("AB", "ABC", 
"CD", "EF", "GH", "IJ", "KL", "MN", "OP", "QR", "ST", "UV", "VCB", 
"WX", "YZ"), class = "factor"), someName = structure(c(9L, 15L, 
12L, 5L, 14L, 13L, 4L, 2L, 7L, 3L, 11L, 1L, 8L, 6L, 10L), .Label = c("dfadasad", 
"dsfsss", "dss", "ewrwe", "gdfgb", "gdgfg", "gggg", "ghfgh", 
"gsdfg", "hgjkk", "hhhhjf", "sfdgsf", "sfsdv", "sr", "tfgsdfg"
), class = "factor"), somevalue = c(123L, 234L, 365L, 53453L, 
64564L, 4234234L, 5L, 3453L, 667L, 7567L, 55L, 8657L, 1234L, 
234455L, 5555667L)), .Names = c("myIDColumn", "someName", "somevalue"
), class = "data.frame", row.names = c(NA, -15L))

user1609452 · Accepted Answer · 2013-05-28T02:23:30.233

2

There are probably better ways to do it but you could create a temporary dataframe:

df1 <- structure(list(ID = c("AB", "CD;EF;QR", "IJ", "KL", "MN", "OP;WX", 
"WW", "YZ"), someText = c("ada", "dfsdf", "fgfgd", "fgdg", "gh", 
"jhjhj", "ghjgjhgjghj", "kkl"), someThing = 12:19), .Names = c("ID", 
"someText", "someThing"), class = "data.frame", row.names = c(NA, 
-8L))


df2 <- structure(list(myIDColumn = c("AB", "CD", "EF", "GH", "IJ", "KL", 
"MN", "OP", "QR", "ST", "UV", "WX", "YZ", "ABC", "VCB"), someName = c("gsdfg", 
"tfgsdfg", "sfdgsf", "gdfgb", "sr", "sfsdv", "ewrwe", "dsfsss", 
"gggg", "dss", "hhhhjf", "dfadasad", "ghfgh", "gdgfg", "hgjkk"
), somevalue = c(123L, 234L, 365L, 53453L, 64564L, 4234234L, 
5L, 3453L, 667L, 7567L, 55L, 8657L, 1234L, 234455L, 5555667L)), .Names = c("myIDColumn", 
"someName", "somevalue"), class = "data.frame", row.names = c(NA, 
-15L))
f <- function(x) {
    y <-  unlist(strsplit(x$ID,';'))
    data.frame(ID = x$ID, someText = x$someText, someThing = x$someThing, ID1 = y) 
}
library(plyr)
df3 <- ddply(df1, .(ID), f)

> df3
         ID    someText someThing ID1
1        AB         ada        12  AB
2  CD;EF;QR       dfsdf        13  CD
3  CD;EF;QR       dfsdf        13  EF
4  CD;EF;QR       dfsdf        13  QR
5        IJ       fgfgd        14  IJ
6        KL        fgdg        15  KL
7        MN          gh        16  MN
8     OP;WX       jhjhj        17  OP
9     OP;WX       jhjhj        17  WX
10       WW ghjgjhgjghj        18  WW
11       YZ         kkl        19  YZ

You could merge this with your dataframe df2 and summarize the data:

mm <- merge(df2,df3,by.y='ID1',by.x='myIDColumn',all.y=TRUE)
ddply(mm, .(ID,someText, someThing), summarize,  
           somevalue = paste(somevalue, collapse=','),
                 someName = paste(someName, collapse = ","))

        ID    someText someThing   somevalue            someName
1       AB         ada        12         123               gsdfg
2 CD;EF;QR       dfsdf        13 234,365,667 tfgsdfg,sfdgsf,gggg
3       IJ       fgfgd        14       64564                  sr
4       KL        fgdg        15     4234234               sfsdv
5       MN          gh        16           5               ewrwe
6    OP;WX       jhjhj        17   3453,8657     dsfsss,dfadasad
7       WW ghjgjhgjghj        18          NA                  NA
8       YZ         kkl        19        1234               ghfgh

edited May 28 '13 at 02:23

answered May 28 '13 at 01:03

user1609452

4,406
1
15
20

:Thanks for your code! Somehow when i run your code on my actual dataset , i get the following columns in my output file (in order): myIDCoumn someName someValue ID someText someThing – RnD May 28 '13 at 02:14
and the someName and someValue column values donot collapse with a comma – RnD May 28 '13 at 02:15
I have `dput` my values for `df1` and `df2`. Yours probably have `stringsAsFactors = TRUE`. – user1609452 May 28 '13 at 02:22
i am not familiar with dput...can you please elaborate as to how i can fix my issue. Thank you so much for helping me out. – RnD May 28 '13 at 02:24
`dput(x)` will give you a version of your object `x` that other people can source in. I dont know what your `df1` and `df2` contain so it is harder to help. – user1609452 May 28 '13 at 02:26
"999", "9990", "9991", "9992", "9993", "9994", "9997"), class = "factor")), .Names = c("UniProtID", "ApprovedSymbol", "EntrezID"), class = "data.frame", row.names = c(NA, -18902L))"999", "9990", "9991", "9992", "9993", "9994", "9997"), class = "factor")), .Names = c("UniProtID", "ApprovedSymbol", "EntrezID"), class = "data.frame", row.names = c(NA, -18902L)) – RnD May 28 '13 at 02:35
above are the last lines of df2 – RnD May 28 '13 at 02:37
"Q9Y6G9", "Q9Y6I3", "Q9Y6M1", "Q9Y6M7", "Q9Y6X9"), class = "factor")), .Names = c("Protein.Group.Accessions", "Modifications", "PSM", "ID1"), row.names = c(NA, -640L), class = "data.frame")......last line of df3 – RnD May 28 '13 at 02:37
Your dataframes contain factors. Read http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters to help convert them – user1609452 May 28 '13 at 02:39
i don't think the issue is that because once i read both frames i do the following: gene_prot_mappings$UniProtID <- as.character(gene_prot_mappings$UniProtID) gene_prot_mappings$ApprovedSymbol <- as.character(gene_prot_mappings$ApprovedSymbol) gene_prot_mappings$EntrezID <- as.character(gene_prot_mappings$EntrezID) a3$Protein.Group.Accessions <- as.character(a3$Protein.Group.Accessions) a3$Modifications <- as.character(a3$Modifications) a3$X..PSMs <- as.character(a3$X..PSMs) – RnD May 28 '13 at 02:53
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/30719/discussion-between-user1609452-and-maziz) – user1609452 May 28 '13 at 02:55
so it works now...i had to assign mm<-ddply(mm, .(ID,someText, someThing), summarize, somevalue = paste(somevalue, collapse=','), someName = paste(someName, collapse = ",")) in the end when i write to a file. Thanks so much for the code.... – RnD May 28 '13 at 04:05

R- merge two dataframes but values of ID have semicolons

1 Answers1