0

I want to define a new column of a dataframe as a function of all the other rows that have a particular value in a particular column.

For example:

mtcars

I want the difference between the mpg of each car and the average mpg of all the cars with the same cyl. It's something like the code below, but obviously the second mtcars$cyl would need to be different!

mtcars$dif_mpg = mtcars$mpg - mean(mtcars[mtcars$cyl == mtcars$cyl, ]$mpg)
user2374133
  • 193
  • 1
  • 11

2 Answers2

1

Something like this should do the job (in base R):

transform(mtcars, dif_mpg=mpg-ave(mpg, cyl, FUN=mean))

ave computes FUN on subgroups of mpg defined by cyl. transform allows you to add/modify columns to a data frame, and also evaluates expressions in the context of the data frame (so you don't have to type out mtcars$mpg, etc.). Here are the first 6 rows of the result:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb     dif_mpg
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  1.25714286
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  1.25714286
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 -3.86363636
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  1.65714286
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  3.60000000
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 -1.64285714

Other alternatives include dplyr package (as shown by David Robinson), data.table:

library(data.table)
(data.table(mtcars, keep.rownames=T)[, dif_mpg:=mpg - mean(mpg), by=cyl])

And plyr (though you should use dplyr over plyr, as it is much faster):

library(plyr)
ddply(mtcars, "cyl", transform, dif_mpg=mpg-mean(mpg))
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • is there a whole family of functions like ave? what if instead of averaging I wanted to sum? or get the standard deviation of, etc. – user2374133 Aug 25 '14 at 18:40
  • nvm. stupid question. ave has a function parameter – user2374133 Aug 25 '14 at 18:40
  • `ave` does not stand for `average`. `ave` applies a function of your choice to each subgroup. Notice `FUN=mean` part of the call. You can change `FUN` to be `sum`, `sd`, etc. – BrodieG Aug 25 '14 at 18:41
0

This kind of grouping operation is well handled by the dplyr package (which you would need to install first). In this case the solution would be:

library(dplyr)
mtcars <- mtcars %>% group_by(cyl) %>% mutate(dif_mpg=mpg - mean(mpg))
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • I noticed that neither `data.table` nor `ddply` were keeping row names as is the case with `transform`. For `data.table` there is a `keep.rownames` option. Is there something similar for `dplyr`? – BrodieG Aug 25 '14 at 18:46
  • @BrodieG: this is a good point. Hadley [intentionally designed dplyr](https://github.com/hadley/dplyr/issues/366) to discard rownames, so the only thing to do is add them as another column first. – David Robinson Aug 25 '14 at 18:48