12

I have a data frame df with rows that are duplicates for the names column but not for the values column:

name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y

I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:

name    value   etc1    etc2
A       10      1       X
B       2       1       Y
C       45      1       Y

I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?

agstudy
  • 119,832
  • 17
  • 199
  • 261
biohazard
  • 2,017
  • 10
  • 28
  • 41
  • 1
    Are your other columns `etcX` also guaranteed to be the same for rows with the same `name`? – Hong Ooi Jun 29 '13 at 18:57
  • @HongOoi Yes I believe so, I filtered them in the previous step to get similar contents. – biohazard Jun 29 '13 at 19:38
  • @agstudy Sorry, the expected output was already included in my post, but I forgot to say so in my sentence. This has been corrected by user Metrics. – biohazard Jun 29 '13 at 19:43

5 Answers5

15

Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)

library(data.table)
dat <- read.table(text='name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
  name etc1 etc2 mm
1:    A    1    X 10
2:    B    1    Y  2
3:    C    1    Y 45

EDIT extend to more than one value variable

In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this

  name value etc1 etc2     value1
1    A     9    1    X  2.1763485
2    A    10    1    X -0.7954326
3    A    11    1    X -0.5839844
4    B     2    1    Y -0.5188709
5    C    40    1    Y -0.8300233
6    C    50    1    Y -0.7787496

The above solution can be extended like this :

X[,lapply(.SD,mean),keys]
   name etc1 etc2 value     value1
1:    A    1    X    10  0.2656438
2:    B    1    Y     2 -0.5188709
3:    C    1    Y    45 -0.8043865

This will compute the mean for all variables that don't exist in keys list.

agstudy
  • 119,832
  • 17
  • 199
  • 261
12

You can use aggregate() function like below:

aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
talat
  • 68,970
  • 21
  • 126
  • 157
Homa Ghiasi
  • 131
  • 1
  • 5
  • 1
    You don't need to use `df$` inside aggregate because it has a data argument where you specify the data source (df in this case). There is also a formula interface for aggregate that may be easier to read. – talat Feb 17 '15 at 14:16
2

The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:

sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))

sample.m
      name value etc1 etc2
    1    A    10    1    X
    2    B     2    1    Y
    3    C    45    1    Y
S Das
  • 3,291
  • 6
  • 26
  • 41
  • Many thanks! Would you happen to know if there is a shortcut I can use so that I don't have to enter the names of all the other columns? There are actually many more than in the example I gave. – biohazard Jun 29 '13 at 19:52
  • @agstudy I didn't know this either. Thanks for pointing this out. – biohazard Jun 29 '13 at 19:52
1

Assuming your dataframe is df.

install.packages("plyr")
library(plyr)



df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))

df.m
 name value etc1 etc2
1    A      10    1    X
2    B       2    1    Y
3    C      45    1    Y
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • Thanks a lot! There is one caveat though. In the real data set I have about 60 other columns that do not change over the duplicates and should be left intact, do I need to enter them one by one as arguments of `ddply()` or is there a shortcut technique? – biohazard Jun 29 '13 at 19:50
0

This simple one worked for me:

avg_data <- aggregate( . ~ name, df, mean)

Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.

liza
  • 1