consolidate duplicate rows and add column in R

Question

I'd like to know how to consolidate duplicate rows in a data frame and then combine the duplicated values in another column.

Here's a sample of the existing dataframe and two dataframes that would be acceptable as a solution

df1 <- data.frame(col1 = c("test1", "test2", "test2", "test3"), col2 = c(1, 2, 3, 4))
df.ideal <- data.frame(col1 = c("test1", "test2", "test3"), col2 = c(1, "2, 3", 4))
df.ideal2 <- data.frame(col1 = c("test1", "test2", "test3"), 
                        col2 = c(1, 2, 4), 
                        col3 = c(NA, 3, NA))

In the first ideal dataframe, the duplicated row is collapsed and the column is added with both numbers. I've looked at other similar questions on stack overflow, but they all dealt with combining rows. I need to delete the duplicate row because I have another dataset I'm merging it with that needs the a certain number of rows. So, I want to preserve all of the values. Thanks for your help!

score 6 · Accepted Answer · answered Dec 07 '14 at 06:38

To go from df1 to df.ideal, you can use aggregate().

aggregate(col2~col1, df1, paste, collapse=",")
#    col1 col2
# 1 test1    1
# 2 test2  2,3
# 3 test3    4

If you want to get to df.ideal2, that's more of a reshaping from long to wide process. You can do

reshape(transform(df1, time=ave(col2, col1, FUN=seq_along)), idvar="col1", direction="wide")
#    col1 col2.1 col2.2
# 1 test1      1     NA
# 2 test2      2      3
# 4 test3      4     NA

using just the base reshape() function.

Probably `aggregate(col2 ~ col1, df1, toString)` would be cleaner — David Arenburg, Dec 07 '14 at 09:52

akrun · Answer 2 · 2014-12-07T07:06:51.430

Another option would be to use splitstackshape

library(data.table)
library(splitstackshape)
DT1 <- setDT(df1)[,list(col2=toString(col2)) ,col1]
DT1
#    col1 col2
#1: test1    1
#2: test2 2, 3
#3: test3    4

You could split the col2 in DT1 to get the df.ideal2 or

cSplit(DT1, 'col2', sep=',')
#   col1 col2_1 col2_2
#1: test1      1     NA
#2: test2      2      3
#3: test3      4     NA

or from df1

 dcast.data.table(getanID(df1, 'col1'), col1~.id, value.var='col2')
 #   col1 1  2
 #1: test1 1 NA
 #2: test2 2  3
 #3: test3 4 NA

consolidate duplicate rows and add column in R

2 Answers2