Calculate mean of dataset in R that satisfies two conditions

Question

I want to find the mean of the dataset grouped by year and book_Id in R.

I tried to use Tapply but I can only put one index condition in that function.

In SQL it would look something like

Select year, book_id, avg(users_read)
From
Where year = 2018 
Group by year, book_id

So my final table would like 

year  | book_id | avg(users_read)
2018.   1.         12
2018.   2.         8
2018.   3.         13

Possible duplicate of https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame — akrun, Jul 25 '21 at 18:47
Or another duplicate https://stackoverflow.com/questions/38094244/filtering-within-the-summarise-function-of-dplyr — akrun, Jul 25 '21 at 20:35
Does this answer your question? [filtering within the summarise function of dplyr](https://stackoverflow.com/questions/38094244/filtering-within-the-summarise-function-of-dplyr) — TarJae, Jul 25 '21 at 20:43

Ronak Shah · Accepted Answer · 2021-07-24T15:46:15.280

2

Translation of the SQL code in R would be -

res <- aggregate(users_read~year + book_id, subset(df, year == 2018), mean)

Or in dplyr -

library(dplyr)

res <- df %>%
  filter( year == 2018) %>%
  group_by(book_id) %>%
  summarise(users_read = mean(users_read))

edited Jul 24 '21 at 15:46

answered Jul 24 '21 at 11:28

Ronak Shah

377,200
20
156
213

I want to group it by each by each book id – TAllen Jul 24 '21 at 15:34
Did you try the `aggregate` code? It groups by each book id already. Also you only have one book id here because you are filtering by it. `Where year = 2018 & book_id = 1` – Ronak Shah Jul 24 '21 at 15:38
yea I noticed the filter and removed it from my original post, also I just tried the aggregate code and I'm getting "no rows to aggregate" ( might be because I added the column to my table ? ) – TAllen Jul 24 '21 at 15:54
I have updated both the answer to group by `book_id`. Try the `dplyr` answer. – Ronak Shah Jul 24 '21 at 15:55
I got "Nan" , but there are numeric values in that column – TAllen Jul 24 '21 at 16:04
Have you tried using `mean(users_read, na,rm = TRUE)` to remove `NA` values? – Ronak Shah Jul 25 '21 at 00:19

score 1 · Answer 2 · answered Jul 25 '21 at 13:17

Thanks to the sqldf package you can use your SQL statement also in R:

sqldf::sqldf("
Select year, book_id, avg(users_read)
From df1
Where year = 2018 
Group by year, book_id
")

  year book_id avg(users_read)
1 2018       1            10.4
2 2018       2            15.5
3 2018       3             9.0

Reproducible Data

set.seed(123)
n <- 20
df1 <- data.frame(year = sample(2018:2019, n, TRUE),
                  book_id = sample(3, n, TRUE),
                  users_read = sample(c(1:(n-1), NA), n))

Note that column users_read includes a NA value.

df1

   year book_id users_read
1  2018       1          9
2  2018       1         NA
3  2018       1         10
4  2019       1          7
5  2018       3          5
6  2019       2         11
7  2019       3          6
8  2019       2         19
9  2018       1          2
10 2018       2         16
11 2019       3          8
12 2019       2         12
13 2019       1          1
14 2018       3         18
15 2019       3          3
16 2018       1         17
17 2019       3         13
18 2018       2         15
19 2018       1         14
20 2018       3          4

Calculate mean of dataset in R that satisfies two conditions

2 Answers2

Reproducible Data