How do I take row values from a column and combine them, sep by "," based on different column value?

Question

I am working with Proteomic data and testing differences between versions of the analysis software. We are wanting to have a table that lets us know in what versions of the software the proteins appear.

Below is a simplified version of the data table I currently have:

Version Protein.ID Protein name
1.1     A          name 1
1.2     A          name 1
1.1     B          name 2
1.2     B          name 2

I want my table to look like this:

Version   Protein.ID Protein name
1.1, 1.2  A          name 1
1.1, 1.2  B          name 2

I have been looking for 2 days on here and the web and can not find a solution.

I have tried using spread, and aggregate but neither worked. I either got a huge number of columns or a single column lacking the information I was after. I tried using some base R commands like paste but could not get rid of duplicate values.

Example of something I tried:

allver.mergeVerID <- spread(allver.ids, Protein.ID, Ver.ID.Porder)

Error: Each row of output must be identified by a unique combination of keys. 
Keys are shared for 5311 rows:

I also get this error using

allver.mergeVerID <- allver.ids %>% group_by(Protein.ID) %>% 
  summarise(Ver.ID.Porder= toString(Ver.ID.Porder), )

OR

allver.mergeVerID <- aggregate(Ver.ID.Porder ~ Protein.ID, allver.ids, toString)

What does this error mean?

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

Here is one way. After grouping by 'Protein.ID', summarise the 'Version' by pasteing the elements together

library(dplyr)
df1 %>%
  group_by(Protein.ID, `Protein name`) %>%
  summarise(Version = toString(Version))

Or with aggregate from base R

aggregate(Version ~ Protein.ID + `Protein name`, df1, toString)
#  Protein.ID Protein name  Version
#1          A       name 1 1.1, 1.2
#2          B       name 2 1.1, 1.2

NOTE: Both solutions match the expected output

data

df1 <- data.frame(Version = c(1.1, 1.2, 1.1, 1.2),
     Protein.ID = c('A', 'A', 'B', 'B'), `Protein name` = c('name 1', 
  'name 1', 'name 2', 'name 2'), check.names = FALSE, stringsAsFactors = FALSE)

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 18 '19 at 20:56

akrun

874,273
37
540
662

these both kind of work but I still get the error listed above... Where would I add the code to keep all other columns in the table? – Ryandcalvert Oct 18 '19 at 21:03
1

@Ryandcalvert. You showed only two columns and my solution is based on that example – akrun Oct 18 '19 at 21:04
I updated the question with the additional column I would like to keep – Ryandcalvert Oct 18 '19 at 21:13
@Ryandcalvert. Updated the code as well – akrun Oct 18 '19 at 21:13
thank you, I'll try this... any suggestion about the error I'm getting? – Ryandcalvert Oct 18 '19 at 21:18
I like @Ryandcalvert solution. I'd just add as.data.frame() to the dplyr solution to make it a data frame. – Diego Rodrigues Oct 18 '19 at 21:18
You get the error because you're using only one of the variables as key. You need to use both (ID and name). – Diego Rodrigues Oct 18 '19 at 21:19

How do I take row values from a column and combine them, sep by "," based on different column value?

1 Answers1

data