25

I want to add a column of means based on factor column in R data.frame. Like this:

df1 <- data.frame(X = rep(x = LETTERS[1:2], each = 3), Y = 1:6)
df2 <- aggregate(data = df1, Y ~ X, FUN = mean)
df3 <- merge(x = df1, y = df2, by = "X", suffixes = c(".Old",".New"))
df3
#   X Y.Old Y.New
# 1 A     1     2
# 2 A     2     2
# 3 A     3     2
# 4 B     4     5
# 5 B     5     5
# 6 B     6     5

To accomplish this problem I've to create two unnecessary data.frames. I'd like to know a way to append a column of means by factor column into my original data.frame without creating any extra data.frames. Thanks for your time and help.

Jaap
  • 81,064
  • 34
  • 182
  • 193
MYaseen208
  • 22,666
  • 37
  • 165
  • 309

4 Answers4

29

Two alternative ways of doing this:

1) with the package:

library(dplyr)
df1 <- df1 %>% 
  group_by(X) %>% 
  mutate(Y.new = mean(Y))

2) with the package:

library(data.table)
setDT(df1)[, Y.new := mean(Y), by = X]

both give the following result:

> df1
   X Y Y.new
1: A 1     2
2: A 2     2
3: A 3     2
4: B 4     5
5: B 5     5
6: B 6     5
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Awesome answer. It is not so obvious that you want to `group_by` followed by `mutate` in the `dplyr` answer so this taught me about it. – InfiniteFlash Apr 07 '20 at 21:00
  • If some rows have NAs, they will also have the mean attached to them. Do you think there's an easy way to have the with NAs, have NA returned to them in the mean column? – Sazid Ahmad Nov 07 '22 at 00:41
  • 1
    @SazidAhmad for the data.table method: `setDT(df1)[, Y.new := mean(Y, na.rm = TRUE), by = X][is.na(Y), Y.new := NA]` – Jaap Nov 07 '22 at 13:32
  • 1
    @SazidAhmad or in a more direct way: `setDT(df1)[, Y.new := mean(Y, na.rm = TRUE) * (NA^is.na(Y)), by = X]` – Jaap Nov 07 '22 at 13:33
  • Thank you! the first method was successful! Do you know how to do this using the dplyr package? Only asking because I am very used to dplyr. In any case, this has been very helpful! – Sazid Ahmad Nov 16 '22 at 20:35
  • @SazidAhmad The 2nd method can be used inside `mutate` from `dplyr` as well: `mutate(Y.new = mean(Y, na.rm = TRUE) * (NA^is.na(Y)))`; another possibility with the `replace`-function: `mutate(Y.new = mean(Y, na.rm = TRUE), Y.new = replace(Y.new, is.na(Y), NA))` – Jaap Nov 17 '22 at 11:58
25

This is what the ave function is for.

df1$Y.New <- ave(df1$Y, df1$X)
John
  • 23,360
  • 7
  • 57
  • 83
8

ddply and transform to the rescue (although I'm sure you'll get at least 4 different ways to do this):

library(plyr)
ddply(df1,.(X),transform,Y.New = mean(Y))
  X Y Y.New
1 A 1     2
2 A 2     2
3 A 3     2
4 B 4     5
5 B 5     5
6 B 6     5
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
joran
  • 169,992
  • 32
  • 429
  • 468
4

Joran answered beautifully, This is not an answer to your question but an extension of the conversation. If you're looking for table of means for two categorical variable's relationship to a dependent here's the Hadley function for that:

cast(CO2, Type ~ Treatment, value="uptake", fun.aggregate=mean, margins=TRUE)

Here's a head view of CO2 data, and a look at the means table:

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2

> library(reshape)

> cast(CO2, Type ~ Treatment, mean, margins=TRUE)  
         Type nonchilled  chilled    (all)
1      Quebec   35.33333 31.75238 33.54286
2 Mississippi   25.95238 15.81429 20.88333
3       (all)   30.64286 23.78333 27.21310
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519