Find average value of lines which have duplicated variable

Question

Input:

How do I take the average of the lines which are duplicated in Var1.

    Var1            Var2            Var3    value   
1   hsa-let-7a-5p   hsa-let-7a-1    124G    15.1096198266
2   hsa-let-7a-5p   hsa-let-7a-2    124G    15.1100852974
3   hsa-let-7a-5p   hsa-let-7a-3    124G    15.1092706389
24  hsa-miR-125b-5p hsa-mir-125b-1  124G    7.785156036
25  hsa-miR-125b-5p hsa-mir-125b-2  124G    7.785156036

Output:

    Var1                Var3    value   
    hsa-let-7a-5p       124G    "Average of hsa-let-7a in Var2 in input"
    hsa-miR-125b-5p     124G    "Average of hsa-mir-125b in Var2 in input"

The question has been answered several time on SO. Try: `aggregate(df$value, by=list(df$Var1), mean)`. There also lots of solution using `data.table`, `dplyr` package — Colonel Beauvel, Aug 24 '15 at 07:24

score 1 · Answer 1 · answered Aug 24 '15 at 07:25

1

You didn't say what value to use for Var3 in the new summarised form, so I will assume just the first Var3 (doesn't matter for your current sample where all are the same).

require(dplyr)
newdf <- df %>%
  group_by(Var1) %>%
  summarize(Var3=first(Var3),
            value=mean(value))

Output

> newdf
Source: local data frame [2 x 3]

             Var1 Var3     value
1   hsa-let-7a-5p 124G 15.109659
2 hsa-miR-125b-5p 124G  7.785156

answered Aug 24 '15 at 07:25

Ricky

4,616
6
42
72

1

instead of giving an answer to a question raised 100 times, please consider indicating the OP he should search before and provide the appropriate link to avoid a 101 topic ... – Colonel Beauvel Aug 24 '15 at 07:28
noted, it just wast faster for me to answer than to search for the duplicate... I guess I can just ignore the question – Ricky Aug 26 '15 at 02:45

score 1 · Accepted Answer · answered Aug 24 '15 at 07:27

I would work with the plyr package here.

require(plyr)    
df2 <- ddply(df,.(Var1,Var3),summarize, Avg=mean(value))

Within the brackets you state all the variables you want to keep and you can calculate the mean, sd or whatever you want of the other columns. However with large datasets plyr sometimes gets a bit slow.

The dplyr package is suppose to preform better but I don't really have any experience with that.

Find average value of lines which have duplicated variable

2 Answers2