Using ddply to summarise data in R

Question

I have defined a function 'average' and am using it in ddply:

average <- function (parameter,speed) {
  sequence = seq(min(speed), max(speed), by=4.5)
  interval = cut(speed, sequence)
  avg = tapply(parameter, interval, mean)
  avg
}


df <- ddply(data1, c(unique('class'),unique('PrecVehClass')), summarise,avg.spacing=average(spacing,velocity),avg.headway=average(headway,velocity),avg.speed=average(velocity,velocity))

As you can see, the average function creates intervals using 'cut' and then finds average. I want to display the intervals also in my output. Currently I get the following output:

> head(df)
  class PrecVehClass avg.spacing avg.headway avg.speed
1     1            1      129.10        2.50     51.80
2     1            1       91.80        1.62     56.79
3     1            2       25.65     6744.06      2.55
4     1            2       31.86       45.23      7.18
5     1            2       35.43        3.25     11.63
6     1            2       38.45        2.85     16.21

How can I add a new column which displays the interval (i.e. the minimum and maximum value e.g. [31.8,36.2]) in each row?

EDIT

Following are the first 6 rows of my data set:

> dput(head(data1))
structure(list(vehicle = c(2L, 2L, 2L, 2L, 2L, 2L), frame = 43:48, 
    globalx = c(6451214.156, 6451216.824, 6451219.616, 6451222.548, 
    6451225.462, 6451228.376), class = c(2L, 2L, 2L, 2L, 2L, 
    2L), velocity = c(37.76, 37.9, 38.05, 38.18, 38.32, 38.44
    ), acceleration = c(10.44, 9.3, 4.36, -0.73, -1.15, 1.9), 
    lane = c(2L, 2L, 2L, 2L, 2L, 2L), precedingveh = c(0L, 0L, 
    0L, 0L, 0L, 0L), followingveh = c(13L, 13L, 13L, 13L, 13L, 
    13L), spacing = c(0, 0, 0, 0, 0, 0), headway = c(0, 0, 0, 
    0, 0, 0), u = c("no", "no", "no", "no", "no", "no"), PrecVehClass = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    )), .Names = c("vehicle", "frame", "globalx", "class", "velocity", 
"acceleration", "lane", "precedingveh", "followingveh", "spacing", 
"headway", "u", "PrecVehClass"), row.names = c(NA, 6L), class = "data.frame")

You can see the average function I defined above. In addition to average values in the output, I want to add a new column which displays the 'interval' for which the average was found. If I don't use ddply but use tapply for, say avg.spacing, I will get following output:

p <- tapply(data1$spacing, cut(data1$velocity, seq(min(data1$velocity), max(data1$velocity), by=4.5), mean)

> p
  (0,4.5]   (4.5,9]  (9,13.5] (13.5,18] (18,22.5] (22.5,27] (27,31.5] (31.5,36] (36,40.5] (40.5,45] (45,49.5] (49.5,54] 
 29.52244  37.44980  44.09410  50.19250  56.89366  61.90450  67.21415  72.83281  79.73360  88.38050  96.87901 105.47172 
(54,58.5] (58.5,63] (63,67.5] (67.5,72] (72,76.5] (76.5,81] (81,85.5] (85.5,90] 
116.13763 120.46700 126.49401 136.43546 174.28593 271.90232 255.20733        NA

In the above output you can see that the interval is reported along with the average value of spacing in that interval. I want to get this interval output in my final table like this:

 > head(df)
      class PrecVehClass avg.spacing avg.headway avg.speed  interval
    1     1            1      129.10        2.50     51.80  (0,4.5]
    2     1            1       91.80        1.62     56.79  (4.5,9]
    3     1            2       25.65     6744.06      2.55  (0,4.5]
    4     1            2       31.86       45.23      7.18  (4.5,9]
    5     1            2       35.43        3.25     11.63  (9,13.5]
    6     1            2       38.45        2.85     16.21  (13.5,18]

I don't know how to specify this in the 'average' function OR ddply command. Please help

It is much easier to help if you provide a [minimal, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — Henrik, Feb 25 '14 at 21:56

Gregor Thomas · Accepted Answer · 2014-02-26T00:14:23.977

Generally the point of bundling commands up into a function is so you don't have to worry about the intermediate steps. You've done that, but now you want the intermediate results too (your "interval"). I think the only good solution is to take your function apart.

Defining interval first, you can just use it as a grouping variable in ddply and use plain old mean, unless I'm misunderstanding the purpose of your average function.

df$interval <- with(df, cut(velocity, seq(min(velocity), max(velocity), by = 4.5)))
df <- ddply(df, c("class", "PrecVehClass", "interval"), summarise,
            avg.spacing = mean(spacing),
            avg.headway = mean(headway),
            avg.speed = mean(velocity))

Notice also the grouping variables in ddply, you shouldn't need the unique() wrapper.

A ddply example:

df1 <- data.frame(x = rnorm(100))
df1$interval <- cut(df1$x, breaks=c(-10, -1, 1, 10))
ddply(df1, "interval", summarize, mean_within_interval = mean(x))
  interval mean_within_interval
1 (-10,-1]           -1.5262258
2   (-1,1]            0.0880585
3   (1,10]            1.4796220

Thanks, but this doesn't work for me. The purpose of my 'average' function is to get the mean over a defined interval. Your code calculates mean over all values — umair durrani, Feb 26 '14 at 00:03
@umairdurrani I corrected an error (I had "speed" where I should have had "velocity"), but the code does work. In the second argument of `ddply`, it specifies `"interval"` as a grouping variable. Thus the means will be calculated for each unique value of interval. — Gregor Thomas, Feb 26 '14 at 00:09

Using ddply to summarise data in R

EDIT

1 Answers1