-1

I want to find the mean of the dataset grouped by year and book_Id in R.

I tried to use Tapply but I can only put one index condition in that function.

In SQL it would look something like

Select year, book_id, avg(users_read)
From
Where year = 2018 
Group by year, book_id

So my final table would like 

year  | book_id | avg(users_read)
2018.   1.         12
2018.   2.         8
2018.   3.         13
TAllen
  • 3
  • 3
  • 1
    Possible duplicate of https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame – akrun Jul 25 '21 at 18:47
  • 1
    Or another duplicate https://stackoverflow.com/questions/38094244/filtering-within-the-summarise-function-of-dplyr – akrun Jul 25 '21 at 20:35
  • 2
    Does this answer your question? [filtering within the summarise function of dplyr](https://stackoverflow.com/questions/38094244/filtering-within-the-summarise-function-of-dplyr) – TarJae Jul 25 '21 at 20:43

2 Answers2

2

Translation of the SQL code in R would be -

res <- aggregate(users_read~year + book_id, subset(df, year == 2018), mean)

Or in dplyr -

library(dplyr)

res <- df %>%
  filter( year == 2018) %>%
  group_by(book_id) %>%
  summarise(users_read = mean(users_read))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I want to group it by each by each book id – TAllen Jul 24 '21 at 15:34
  • Did you try the `aggregate` code? It groups by each book id already. Also you only have one book id here because you are filtering by it. `Where year = 2018 & book_id = 1` – Ronak Shah Jul 24 '21 at 15:38
  • yea I noticed the filter and removed it from my original post, also I just tried the aggregate code and I'm getting "no rows to aggregate" ( might be because I added the column to my table ? ) – TAllen Jul 24 '21 at 15:54
  • I have updated both the answer to group by `book_id`. Try the `dplyr` answer. – Ronak Shah Jul 24 '21 at 15:55
  • I got "Nan" , but there are numeric values in that column – TAllen Jul 24 '21 at 16:04
  • Have you tried using `mean(users_read, na,rm = TRUE)` to remove `NA` values? – Ronak Shah Jul 25 '21 at 00:19
1

Thanks to the package you can use your SQL statement also in R:

sqldf::sqldf("
Select year, book_id, avg(users_read)
From df1
Where year = 2018 
Group by year, book_id
")
  year book_id avg(users_read)
1 2018       1            10.4
2 2018       2            15.5
3 2018       3             9.0

Reproducible Data

set.seed(123)
n <- 20
df1 <- data.frame(year = sample(2018:2019, n, TRUE),
                  book_id = sample(3, n, TRUE),
                  users_read = sample(c(1:(n-1), NA), n))

Note that column users_read includes a NA value.

 df1
   year book_id users_read
1  2018       1          9
2  2018       1         NA
3  2018       1         10
4  2019       1          7
5  2018       3          5
6  2019       2         11
7  2019       3          6
8  2019       2         19
9  2018       1          2
10 2018       2         16
11 2019       3          8
12 2019       2         12
13 2019       1          1
14 2018       3         18
15 2019       3          3
16 2018       1         17
17 2019       3         13
18 2018       2         15
19 2018       1         14
20 2018       3          4
Uwe
  • 41,420
  • 11
  • 90
  • 134