average between duplicated rows in R

Question

I have a data frame df with rows that are duplicates for the names column but not for the values column:

name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y

I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:

name    value   etc1    etc2
A       10      1       X
B       2       1       Y
C       45      1       Y

I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?

Are your other columns `etcX` also guaranteed to be the same for rows with the same `name`? — Hong Ooi, Jun 29 '13 at 18:57
@HongOoi Yes I believe so, I filtered them in the previous step to get similar contents. — biohazard, Jun 29 '13 at 19:38
@agstudy Sorry, the expected output was already included in my post, but I forgot to say so in my sentence. This has been corrected by user Metrics. — biohazard, Jun 29 '13 at 19:43

agstudy · Accepted Answer · 2013-06-29T20:19:18.697

Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)

library(data.table)
dat <- read.table(text='name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
  name etc1 etc2 mm
1:    A    1    X 10
2:    B    1    Y  2
3:    C    1    Y 45

EDIT extend to more than one value variable

In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this

  name value etc1 etc2     value1
1    A     9    1    X  2.1763485
2    A    10    1    X -0.7954326
3    A    11    1    X -0.5839844
4    B     2    1    Y -0.5188709
5    C    40    1    Y -0.8300233
6    C    50    1    Y -0.7787496

The above solution can be extended like this :

X[,lapply(.SD,mean),keys]
   name etc1 etc2 value     value1
1:    A    1    X    10  0.2656438
2:    B    1    Y     2 -0.5188709
3:    C    1    Y    45 -0.8043865

This will compute the mean for all variables that don't exist in keys list.

score 12 · Answer 2 · edited Feb 17 '15 at 14:14

12

You can use aggregate() function like below:

aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)

edited Feb 17 '15 at 14:14

talat

68,970
21
126
157

answered Feb 17 '15 at 14:08

Homa Ghiasi

131
1
5

1

You don't need to use `df$` inside aggregate because it has a data argument where you specify the data source (df in this case). There is also a formula interface for aggregate that may be easier to read. – talat Feb 17 '15 at 14:16

score 2 · Answer 3 · answered Jun 29 '13 at 19:34

The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:

sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))

sample.m
      name value etc1 etc2
    1    A    10    1    X
    2    B     2    1    Y
    3    C    45    1    Y

Many thanks! Would you happen to know if there is a shortcut I can use so that I don't have to enter the names of all the other columns? There are actually many more than in the example I gave. — biohazard, Jun 29 '13 at 19:52
@agstudy I didn't know this either. Thanks for pointing this out. — biohazard, Jun 29 '13 at 19:52

Metrics · Answer 4 · 2013-06-29T19:47:21.110

1

Assuming your dataframe is df.

install.packages("plyr")
library(plyr)



df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))

df.m
 name value etc1 etc2
1    A      10    1    X
2    B       2    1    Y
3    C      45    1    Y

edited Jun 29 '13 at 19:47

answered Jun 29 '13 at 18:58

Metrics

15,172
7
54
83

Thanks a lot! There is one caveat though. In the real data set I have about 60 other columns that do not change over the duplicates and should be left intact, do I need to enter them one by one as arguments of `ddply()` or is there a shortcut technique? – biohazard Jun 29 '13 at 19:50

score 0 · Answer 5 · answered Jan 10 '23 at 19:33

This simple one worked for me:

avg_data <- aggregate( . ~ name, df, mean)

Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.

average between duplicated rows in R

5 Answers5

Linked