R- Subtracting the mean of a group from each element of that group in a dataframe

Question

I am trying to merge a vector 'means' to a dataframe. My dataframe looks like this Data = growth

I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command:

means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)

Then, I selected the means for Measurement 1 as follows as I am only interested in these means.

meansT0<-means[which(means$Measurement=="1"),]

Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe.

The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). Maybe there is no need to add the means column to the dataframe? Do you know any command to do that ?

[27.06.18] I made up this simplified dataframe, I hope this help understanding. So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1.

So, if I get the means by group (1 group= Population + Temperature + Measurement):

means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means

I got these values of means (in this example) :

Population Temperature Measurement       TL
JUB          15           **1**           **12.00000**
JUB          20           **1**           **15.66667**
JUB          15           2           17.66667
JUB          20           2           18.66667
JUB          15           3           23.66667
JUB          20           3           24.33333

We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command): -for the group JUB+15+Measurement 1, mean = 12 -for the group JUB+20+Measurement 1, mean = 15.66

growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))    
print(growth)

I hope with this, you can understand better what I am trying to do. I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes.

Please set up a reproducible example (no screenshots) that also show desired result. — Parfait, Jun 26 '18 at 19:31

akrun · Answer 1 · 2018-06-27T15:48:35.587

2

Here is an option with tidyverse. After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column (.) with the mean of it.

library(tidyverse)
growth %>% 
       group_by(Population, Temperature, Replicat, Size, Measurement) %>% 
       mutate_at(vars(HL, TL), funs(MeanGroupDiff = . 
                  - mean(.[Measurement == 1])))

Using a reproducible example with mtcars dataset

data(mtcars)
mtcars %>%
   group_by(cyl, vs) %>% 
   mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))

edited Jun 27 '18 at 15:48

answered Jun 26 '18 at 20:01

akrun

874,273
37
540
662

Thank you for your reply. I realized my post was not clear enough and clarify my description above. For each individual of each group used to calculate means (population + temperature + size + replicate + measurement) in the dataframe, I want to substrat to each individual of each group (population + temperature + size + replicate + measurement) the mean of the group at measurement 1 (population + temperature + size + replicate, MEASUREMENT 1). – Marine Jun 27 '18 at 15:45
So in reality, it is not really the mean of its belonging group that I want to substrat, but the means calculated for groups at Measurement 1. This is to control for size at the beginning of the experiment. Hope I am clearer in my explanations. I set up an example above, that shows what I am trying to explain. Maybe this will be easier to understand. – Marine Jun 27 '18 at 15:46
@Marine Not sure it is clear to me. Can you check the updated solution – akrun Jun 27 '18 at 15:48
`funs` is soft-deprecated as of dplyr 0.8. The replacement is something like `mutate_at(vars(mpg, disp), .funs = list(MeanGroupDiff = .- mean(.[am==1])))` but when doing that I get `Error in check_names_df(i, x): object 'am' not found`. – Max Ghenis Mar 08 '20 at 14:36
1

@MaxGhenis It could be `list(MeanGroupDiff = ~ . - mean(.[am == 1]))` – akrun Mar 08 '20 at 17:14

score 1 · Answer 2 · answered Jun 26 '18 at 19:18

Have you considered using the data.table package? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run.

The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table.

To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr, just throwing out a suggestion based on what I have found the most useful for my personal work.

library(data.table)

## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]

## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]

## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]

## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]

print(DT)

#     mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4   6  0    3         19.75                1.65
# 2: 18.1   6  0    3         19.75               -1.65
# 3: 19.2   6  0    4         18.50                0.70
# 4: 17.8   6  0    4         18.50               -0.70
# 5: 21.0   6  1    4         21.00                0.00
# 6: 21.0   6  1    4         21.00                0.00
# 7: 19.7   6  1    5         19.70                0.00

Thank you for your reply. I realized my post was not clear enough and clarify my description above. For each individual in the dataframe, I want to substrat the mean of its belonging group calculated at MEASUREMENT 1. I set up an example above. I tried the code you gave to add the column of the means but it does not work for me. I got this error message: Error in `[.data.table`(growth2, , `:=`(means_groups, mean("TL")), keyby = .("Population", : The items in the 'by' or 'keyby' list are length (1,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (3365). — Marine, Jun 27 '18 at 15:22

Parfait · Answer 3 · 2018-06-27T16:09:11.607

1

Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). Then, run an ifelse conditional logic calculation for needed columns. Since by will return a list of data frames, bind all outside with do.call():

df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
  # TL CORRECTION      
  sub$Correct_TL <- ifelse(sub$Measurement != 1, 
                           sub$TL - mean(subset(sub, Measurement == 1)$TL),
                           sub$TL)
  # ADD OTHER CORRECTIONS

  return(sub)  
})

final_df <- do.call(rbind, df_list)

Output (using posted data)

final_df

#    Population Measurement Temperature TL   New_TL Correct_TL
# 1         JUB           1          15 11    11-12 11.0000000
# 2         JUB           1          15 12    12-12 12.0000000
# 3         JUB           1          15 13    13-12 13.0000000
# 7         JUB           2          15 16    16-12  4.0000000
# 8         JUB           2          15 17    17-12  5.0000000
# 9         JUB           2          15 20    20-12  8.0000000
# 13        JUB           3          15 25    25-12 13.0000000
# 14        JUB           3          15 22    22-12 10.0000000
# 15        JUB           3          15 24    24-12 12.0000000
# 4         JUB           1          20 15 15-15.66 15.0000000
# 5         JUB           1          20 18 18-15.66 18.0000000
# 6         JUB           1          20 14 14-15.66 14.0000000
# 10        JUB           2          20 21 21-15.66  5.3333333
# 11        JUB           2          20 19 19-15.66  3.3333333
# 12        JUB           2          20 16 16-15.66  0.3333333
# 16        JUB           3          20 26 26-15.66 10.3333333
# 17        JUB           3          20 24 24-15.66  8.3333333
# 18        JUB           3          20 23 23-15.66  7.3333333

edited Jun 27 '18 at 16:09

answered Jun 26 '18 at 19:28

Parfait

104,375
17
94
125

Thank you for your reply. This command works. However, by doing this, I get a new column that is the difference between each element of the dataframe and of the mean of its group (that is in this case Population + Temperature + Replicat + Size + Measurement). What I need is to substrat the mean of each group at Measurement 1 (when I measured individuals for the first time) to each element of the dataframe, still based on their belonging group, but for Measurements 2, 3 and 4. Actually, this is to 'correct' for the TL (total length) at the beginning of the experiment. – Marine Jun 26 '18 at 21:05
I almost knew this was not entirely what you need and hence my comment above under your post. Please show and do not just tell with actual current data and desired result. See: [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Parfait Jun 26 '18 at 21:09
I tried to set up an example. I added it, explanations and codes in the description. I hope with this, you could help me to resolve the problem. – Marine Jun 27 '18 at 15:04
Thank you very much! This works perfectly fine. I could do it for the entire table. By chance, do you know how to do kind of the same but with some values ? Indeed, from another table, I need to substrat 3 different values (for 3 different populations) to each individual based on their population. For instance, for individuals from population JUB, I need to substract a certain value that is 12 for that population; for individuals from population ZHA, I need to substract 15 to each individual; for individuals from population SEB, I need to substrat 14 to each individual. – Marine Jun 28 '18 at 17:28
As for those conditional populations logic, add a line in the `by` call for nested `ifelse` to subtract by the specific group: `sub$Correct_TL <- ifelse(sub$Population[[1]] == "JUB", sub$Correct_TL - 12, ifelse(sub$Population[[1]] == "ZHA", sub$Correct_TL - 15, ...))` – Parfait Jun 28 '18 at 19:17
Thank you. I could do all of this. Sorry for the late answer, I could not look at this anymore before these last few days. I am gonna note StackOverflow :). Thank you for your precious help again. – Marine Jul 11 '18 at 23:54

R- Subtracting the mean of a group from each element of that group in a dataframe

3 Answers3