subsetting dataframe results in incorrect output

Question

I'm trying to achieve a simple task of creating a subset of my dateframe (df) by calculating the mean from a variable with repeated measurement (measured multiple times a day, over several weeks). This variable is called "consumption" in my df

I followed this example here, and adapted the code to my df and my desired conditions: Calculate mean of column data based on conditions in another column

However, I went and calculated a few of the means by hand (using excel), and just get completely different results

Could someone point me in the right direction of where my code is going wrong?

I do have "0" as a few measurements, and they are important, and need to me included when calculating mean.

Here is a reproducible example:

df <- read.table("https://pastebin.com/raw/Zpa8cLBN", header = T)

library(dplyr)

df_mean <- df %>% group_by(treatment,day,Control) %>% summarise(
  consumption = first(consumption), consumption = last(consumption), consumption = mean(consumption[consumption >= 0]))

desired_results <- read.table("https://pastebin.com/raw/vZten0jd", header = T) # calculated manually in excel

When I compare the two, the results in the column "consumption", which should be the calculated mean, are not correct at all.

Thanks everyone

You need to use different variable names in your `summarise`, because here you are modifying `consumption` each time you call it — jlesuffleur, Jun 15 '20 at 10:14
Hello Thanks for the tip. I will post it as a response. I didn't realize using the same variable name would cause this issue. — Andy, Jun 15 '20 at 10:27

score 1 · Accepted Answer · answered Jun 15 '20 at 10:31

It appears that I need to use variables names for the summerisefunction that are different than the original df

df_mean <- df %>% group_by(treatment,day,Control) %>% summarise(
  Mean_consumption = first(consumption), Mean_consumption = last(consumption), Mean_consumption = mean(consumption[consumption >= 0]))

When cross referenced with my desired_results, it's what I was looking for.

Thanks @jlesuffleur

score 1 · Answer 2 · answered Jun 16 '20 at 00:09

1

We can use data.table

library(data.table)
setDT(df)[, .(Mean_consumption = first(consumption), Mean_consumptionlast = last(consumption), Mean_consumptionfilt = mean(consumption[consumption >= 0])), .(treatment, day, Control)]

answered Jun 16 '20 at 00:09

akrun

874,273
37
540
662

1

Hello Thanks for the comment. I appreciate the input – Andy Jun 16 '20 at 07:21

subsetting dataframe results in incorrect output

2 Answers2