0

Possible Duplicate:
faster way to create variable that aggregates a column by id

So the thing is that i have following data loaded from a CSV file:

id      value2  value3
1.000   0.010   14        
1.000   0.019   15        
0.995   0.024   13        
0.995   0.031   20        
0.990   0.012   13        
.....

I want to calculate the mean/median etc. value of value2 and value3 in groups of id. Afterwards the plan was to be able to sort the result by either value2 or value3.

Is there a way to do such task?

Thanks in advance.

Community
  • 1
  • 1
JavaCake
  • 4,075
  • 14
  • 62
  • 125
  • 1
    The duplicate question might not be a perfect match but this is quite basic and has been asked _many_ times here before. `aggregate`, `tapply`, `split` along with `lapply`/`sapply` and/or `ddply` from the plyr package can all accomplish what you want to do. – Dason Oct 14 '12 at 20:57

3 Answers3

6

It's data.table Sunday! This will scale well for big data -- fast and efficient.

> library(data.table)
> DT <- as.data.table(df)
> DT[, list(val2=mean(value2), val3=mean(value3)), by=id]
      id   val2 val3
1: 1.000 0.0145 14.5
2: 0.995 0.0275 16.5
3: 0.990 0.0120 13.0
GSee
  • 48,880
  • 13
  • 125
  • 145
3
library(plyr)
result <- ddply(df, .(id), function(x) { 
    data.frame(mv2 = mean(x$value2), mv3 = mean(x$value3))
    })

# order by mean value2
arrange(result, mv2)
# and for value 3
arrange(result, mv3)
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
0

Assuming you have the data in a data frame called df you can do the following:

sapply(split(df[-1], df$id), sapply, mean)
Backlin
  • 14,612
  • 2
  • 49
  • 81