4

I would like to calculate a sample mean in R by introducing a specific criteria. For example I have this table and I want the means of only those for whom stage = 1 or 2:

treatment session period stage wage_accepted type 
1            1      1     1            25  low 
1            1      1     3            19  low 
1            1      1     3            15  low 
1            1      1     2            32 high 
1            1      1     2            13  low 
1            1      1     2            14  low 
1            1      2     1            17  low 
1            1      2     4            16  low
1            1      2     5            21  low

The desired out in this case should be:

   stage  mean
      1  21.0 
      2  19.6667

Thanks in advance.

rado
  • 401
  • 3
  • 8
  • 16

4 Answers4

4

With the dplyr library

library(dplyr)

df %>% filter(stage==1 | stage ==2) %>% group_by(stage) %>%
  summarise(mean=mean(wage_accepted))

If you are new to dplyr a bit of explanation:

Take the data frame df then filter where stage is equal to 1 or 2. Then for each group in stage calculate the mean of the wage_accepted

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • Thanks, it's useful. However my data is really big in fact and the above is just an example. I would like to choose 25 answers of a variable which has 50. In this case (filter stage==1 | .... | stage == 25) would be a little bit long. How can I do it more efficiently? – rado Apr 19 '15 at 00:24
  • Use `filter(stage %in% 1:25)` – dimitris_ps Apr 19 '15 at 00:25
  • it is in qualitative not in quantitative. The answers are for example 'A', 'B', 'C' and so on... – rado Apr 19 '15 at 00:27
  • 1
    Yeap, you got the logic! – dimitris_ps Apr 19 '15 at 00:29
3

Assuming you have a csv file for the data, you can read data into a data frame using:

data<-read.csv("PATH_TO_YOUR_CSV_FILE/Name_of_the_CSV_File.csv")

Then you can use either this code relying on sapply():

sapply(split(data$Wage_Accepted,data$Stage),mean)

   1        2        3        4        5 
21.00000 19.66667 17.00000 16.00000 21.00000 

Or this code relying on tapply():

tapply(data$Wage_Accepted,data$Stage,mean)

   1        2        3        4        5 
21.00000 19.66667 17.00000 16.00000 21.00000 
bgfriend0
  • 1,152
  • 1
  • 14
  • 26
Gaurav Sharma
  • 166
  • 10
2

Check this out. It's a toy example, but data.table is so compact. dplyr is great as well obviously.


    library(data.table)

    dat <- data.table(iris)
    dat[Species == "setosa" | Species == "virginica", mean(Sepal.Width), by = Species]

In terms of your need for speed... data.table is a rocket ship look it up. I'll leave it to you to apply this to your question. Best, M2K

miles2know
  • 737
  • 8
  • 17
0

You can do this and then later filter for Stages as per your requirement

# Calculating mean with respect to stages
df = do.call(rbind, lapply(split(data, f = data$stage),function(x) out = data.frame(stage = unique(x$stage), mean = mean(x$wage_accepted))))

# mean for stage 1 and 2
required = subset(df, stage %in% c(1,2))
Veerendra Gadekar
  • 4,452
  • 19
  • 24