1

I have a data frame like

> df[1:10,]
   SID Group Source
1    1    10    C04
2    1     5    M04
3    1     5    M02
4    1     5    M03
5    2    40    M04
6    2    40    M02
7    2    40    M03
8    3    45    M01
9    3    40    M01
10   3    40    C01

Now I want to merge the Source-Column based on SID and Group. So that I get the following Output

  SID  Group  Source
1 1    10     {C04}
2 1    5      {M04, M02, M03}
3 2    40     {M04, M02, M03}
4 3    45     {M01}
5 3    40     {M01, C01}
...

I already tried to use aggregate

> y <- as.data.frame(aggregate(Source~., data=df, paste,collapse=",", na.rm=TRUE))

and order the results based on

> y <- y[order(y$SID,-y$Group),]

which delivers almost the expected result

   SID  Group   Source
 1 1    10      C04 TRUE
 2 1    5       M04 TRUE,M02 TRUE,M03 TRUE
 3 2    40      M04 TRUE,M02 TRUE,M03 TRUE
 4 3    45      M01 TRUE

However I want to get rid of the TRUEs in my last column and I would like to know why do I get these and how can I avoid them?

Does someone have a clue or maybe an explanation? I appreciate your help.

Meldamos
  • 43
  • 4

1 Answers1

1

We can try

aggregate(Source~., df, FUN=toString)
#   SID Group        Source
#1   1     5 M04, M02, M03
#2   1    10           C04
#3   2    40 M04, M02, M03
#4   3    40      M01, C01
#5   3    45           M01

should get the expected output

toString is wrapper for paste(., collapse=', '). There is no na.rm=TRUE in the paste. So adding na.rm=TRUE will add a new element TRUE. The na.rm=TRUE is normally used with mean, median etc, which have that option. For example, if I use that with paste

aggregate(Source~., df, FUN=paste, collapse=', ', na.rm=TRUE)
  #  SID Group                       Source
  #1   1     5 M04 TRUE, M02 TRUE, M03 TRUE
  #2   1    10                     C04 TRUE
  #3   2    40 M04 TRUE, M02 TRUE, M03 TRUE
  #4   3    40           M01 TRUE, C01 TRUE
  #5   3    45                     M01 TRUE

Instead, we can do

aggregate(Source~., df, FUN=function(x) 
                       paste(x[!is.na(x)], collapse=', '))
 #  SID Group        Source
 #1   1     5 M04, M02, M03
 #2   1    10           C04
 #3   2    40 M04, M02, M03
 #4   3    40      M01, C01
 #5   3    45           M01

Or with data.table

library(data.table)
setDT(df)[,list(Source=toString(Source)) , .(SID, Group)]
#   SID Group        Source
#1:   1    10           C04
#2:   1     5 M04, M02, M03
#3:   2    40 M04, M02, M03
#4:   3    45           M01
#5:   3    40      M01, C01
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Ahhh, I didn't know that it is adding the TRUE due to the "na.rm"-Argument. A little strange, since the argument should remove NAs. However, since I got no NAs in my data I removed the argument and now I got it how I wanted. Thank you very much! – Meldamos Jan 22 '16 at 12:15
  • @Meldamos Thanks for the feedback. Strangely, `na.rm=TRUE` with `toString` is not adding the `TRUE` values. – akrun Jan 22 '16 at 12:16