2

I'm new to R (having used MATLAB before) and have tried for a long time to search for a solution for this but I can't find one for this (seemingly) quite easy problem. Here is the problem;

In the first column I have my time values over a couple of days (simplified in this example) and in the second column I have the values over which I want to average. What I want to do is to take all the values that belong to the same time and average over that. I am doing this over rather large data sets so doing this automatically would help greatly.

Let's set-up:

time = rep( c("00:00", "00:10", "00:20", "00:30", "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)

So now I have my matrix with time & values in it and I would like to group all the values that have (eg) "00:00" and calculate the mean of this. After some searching I found out the aggregate() function could help nicely, so I did the following;

aggregate(as.numeric(data[,-1]), by = list(sort(data[,1])), mean) 

which has output

    Group.1    x
1   00:00 77.2
2   00:10 59.2
3   00:20 51.0
4   00:30 49.4
5   00:40 51.4
6   00:50 33.4
7   01:00 33.8
8   01:10 51.6

So it seems to work nicely but when I calculate it by hand the mean of the values are all different. (For instance; for 00:00: (56+3+91+71+8)/5 = 45.8 and NOT 77.2), can anyone tell me what I am doing wrong?

Tim.Lucas
  • 271
  • 2
  • 11

3 Answers3

2

@joran's suggestion (don't scramble the by variable by sorting it) seems to work:

set.seed(101) ## for reproducibility
time = rep( c("00:00", "00:10", "00:20", "00:30", 
      "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)
aggregate(as.numeric(data[,2]),by=list(factor(data[,1])), mean)
##   Group.1    x
## 1   00:00 50.0
## 2   00:10 29.0
## 3   00:20 45.0
## 4   00:30 60.2
## 5   00:40 48.8
## 6   00:50 57.2
## 7   01:00 37.2
## 8   01:10 56.2
##

Check the first group:

mean(as.numeric(data[data[,1]=="00:00","values"]))
## [1] 50

As a further recommendation, I would strongly suggest using data.frame rather than cbind()ing your columns -- this allows you to put times and numeric values together without getting them all coerced to the same type.

(It would be nice to use a built-in times object too: I tried times from the chron package but didn't quite get the hang of it)

dat <- data.frame(time,values)  ## avoid using "data" as a variable name
aggregate(values~time, data=dat, mean)

is much easier to read.

By the way, there are a lot of posts on Stack Overflow comparing various solutions for aggregation (by, aggregate, ddply and friends from the plyr package, and the data.table package): e.g. Elegant way to solve ddply task with aggregate (hoping for better performance) , R: speeding up "group by" operations , How to speed up summarise and ddply? ...

Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks so much! I guess I wasn't sure what to look for, those other questions would have helped too! Thanks for the suggestions, will apply this! – Tim.Lucas Nov 29 '12 at 08:03
1

by is your friend:

by(as.numeric(data[,"values"]),data[,"time"],mean)
Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48
0

I'd suggest setting the index variable (time) as a factor using as.factor().

Then use that as an index ie: aggregate(data$values,by=list(data$time.factor),FUN=mean)

chris
  • 1
  • 2
    Time doesn't have to be a factor for this to work. Also, see the aggregate formula notation. – John Nov 28 '12 at 14:41
  • I tried that but it still doesn't give the right awnser, it is closer though `aggregate(as.numeric(data[,2]),by=list(factor(data[,1])), mean)` – Tim.Lucas Nov 28 '12 at 15:23