Adding a column of means by group to original data

Question

I want to add a column of means based on factor column in R data.frame. Like this:

df1 <- data.frame(X = rep(x = LETTERS[1:2], each = 3), Y = 1:6)
df2 <- aggregate(data = df1, Y ~ X, FUN = mean)
df3 <- merge(x = df1, y = df2, by = "X", suffixes = c(".Old",".New"))
df3
#   X Y.Old Y.New
# 1 A     1     2
# 2 A     2     2
# 3 A     3     2
# 4 B     4     5
# 5 B     5     5
# 6 B     6     5

To accomplish this problem I've to create two unnecessary data.frames. I'd like to know a way to append a column of means by factor column into my original data.frame without creating any extra data.frames. Thanks for your time and help.

Jaap · Answer 1 · 2022-03-08T17:53:00.550

29

Two alternative ways of doing this:

1) with the dplyr package:

library(dplyr)
df1 <- df1 %>% 
  group_by(X) %>% 
  mutate(Y.new = mean(Y))

2) with the data.table package:

library(data.table)
setDT(df1)[, Y.new := mean(Y), by = X]

both give the following result:

> df1
   X Y Y.new
1: A 1     2
2: A 2     2
3: A 3     2
4: B 4     5
5: B 5     5
6: B 6     5

edited Mar 08 '22 at 17:53

answered Dec 08 '15 at 13:56

Jaap

81,064
34
182
193

Awesome answer. It is not so obvious that you want to `group_by` followed by `mutate` in the `dplyr` answer so this taught me about it. – InfiniteFlash Apr 07 '20 at 21:00
If some rows have NAs, they will also have the mean attached to them. Do you think there's an easy way to have the with NAs, have NA returned to them in the mean column? – Sazid Ahmad Nov 07 '22 at 00:41
1

@SazidAhmad for the data.table method: `setDT(df1)[, Y.new := mean(Y, na.rm = TRUE), by = X][is.na(Y), Y.new := NA]` – Jaap Nov 07 '22 at 13:32
1

@SazidAhmad or in a more direct way: `setDT(df1)[, Y.new := mean(Y, na.rm = TRUE) * (NA^is.na(Y)), by = X]` – Jaap Nov 07 '22 at 13:33
Thank you! the first method was successful! Do you know how to do this using the dplyr package? Only asking because I am very used to dplyr. In any case, this has been very helpful! – Sazid Ahmad Nov 16 '22 at 20:35
@SazidAhmad The 2nd method can be used inside `mutate` from `dplyr` as well: `mutate(Y.new = mean(Y, na.rm = TRUE) * (NA^is.na(Y)))`; another possibility with the `replace`-function: `mutate(Y.new = mean(Y, na.rm = TRUE), Y.new = replace(Y.new, is.na(Y), NA))` – Jaap Nov 17 '22 at 11:58

score 25 · Accepted Answer · answered Nov 02 '11 at 05:37

25

This is what the ave function is for.

df1$Y.New <- ave(df1$Y, df1$X)

answered Nov 02 '11 at 05:37

John

23,360
7
57
83

score 8 · Answer 3 · edited May 27 '16 at 08:32

8

ddply and transform to the rescue (although I'm sure you'll get at least 4 different ways to do this):

library(plyr)
ddply(df1,.(X),transform,Y.New = mean(Y))
  X Y Y.New
1 A 1     2
2 A 2     2
3 A 3     2
4 B 4     5
5 B 5     5
6 B 6     5

edited May 27 '16 at 08:32

David Arenburg

91,361
17
137
196

answered Nov 02 '11 at 05:08

joran

169,992
32
429
468

Tyler Rinker · Answer 4 · 2011-11-02T19:51:43.943

Joran answered beautifully, This is not an answer to your question but an extension of the conversation. If you're looking for table of means for two categorical variable's relationship to a dependent here's the Hadley function for that:

cast(CO2, Type ~ Treatment, value="uptake", fun.aggregate=mean, margins=TRUE)

Here's a head view of CO2 data, and a look at the means table:

> head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2

> library(reshape)

> cast(CO2, Type ~ Treatment, mean, margins=TRUE)  
         Type nonchilled  chilled    (all)
1      Quebec   35.33333 31.75238 33.54286
2 Mississippi   25.95238 15.81429 20.88333
3       (all)   30.64286 23.78333 27.21310

Adding a column of means by group to original data

4 Answers4

Linked

Related