Summary means using dplyr

Question

I'm trying to produce a table with summary totals and means across the whole dataset, and then by sub-category (f_grp), and show this by site.

I can use the group_by function to group by, which works well for reporting total_count and Mean_per_litre, but I would then like the same values for each category, as shown in f_grp.

|Site |total_count |Mean_per_litre

|1 |66 |3.33333333
|2 |77 |4.27777778
|3 |65 |3.38541667
|4 |154 |8.85057471
etc

I've tried group_by for both site and f_grp but this isn't quite right

|site |f_grp |total_count |mean_per_litre

|1 |1c |3 |1.666667
|1 |1d |15 |4.166667
|1 |2a |1 |1.666667
|1 |2b |47 |11.190476

This isn't quite right as its not easy to read and I've now lost the original total columns I had in the first table (sorry about the tables, cant get them to work here).

dat$site=as.factor(dat$site)
dat$count=as.numeric(dat$count)
dat$f_grp=as.factor(dat$f_grp)
  
 

# totals across all f_grp
tabl1 <- dat %>%
  group_by(site) %>%
  summarise (total_count = sum(count), Mean_per_litre = mean(count_l_site))
tabl1

# totals FG 1b
tabl2 <- dat %>%
  group_by(site) %>%
  filter(f_grp== '1b') %>%
  summarise ('1b_total_count' = sum(count))
tabl2

### BUT - this doesnt give a correct mean, as it only shows the mean of '1b' when only '1b' is present. I need a mean over the entire dataset at that site.


# table showing totals across whole dataset
tabl7 <- dat %>%
  summarise (total_count = sum(count, na.rm = TRUE), Total_mean_per_litre = mean(count_l_site, na.rm = TRUE))
tabl7

# table with means for each site by fg

table6 <- dat %>%
  group_by(site, f_grp) %>%
  summarise (total_count = sum(count), mean_per_litre = mean(count_l_site, na.rm = TRUE))

table6

Ideally I need a way to extract the f-grp categories, put them as column headings, and then summarise means by site for those categories. But filtering the data and then joining multiple tables, gives incorrect means (as not mean of whole dataset, but a subset of that category, ie: when f_grp value is present only).

Many thanks to all who have read this far :)

> dput(head(dat))
structure(list(X = 1:6, site = structure(c(1L, 10L, 11L, 12L, 
13L, 14L), levels = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18"), class = "factor"), 
    count = c(0, 0, 0, 0, 0, 0), f_grp = structure(c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    ), levels = c("1b", "1c", "1d", "2a", "2b"), class = "factor"), 
    count_l_site = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L
), class = "data.frame")

Updated:

Following advice here from Jon, and using the mtcars data (which worked as expected) I've tried the same method using my own data.

I can almost produce what's needed, but the totals are coming through as a row when they are needed as a column.

tabl1 <- dat %>%
  group_by(site) %>%
  summarise (total_count = sum(count), Mean_per_litre = mean(count_l_site)) %>%
  mutate(fg = "total")

tabl1

tabl2_fg <- dat %>%
  group_by(site, f_grp = as.character(f_grp)) %>%
  summarize(total_count = sum(count), Mean_per_litre = mean(count_l_site))

tabl2_fg

tabl4 <-
  bind_rows(tabl1, tabl2_fg) %>%
  arrange(site, f_grp) %>%
  tidyr::pivot_wider(names_from = f_grp, values_from = c(Mean_per_litre, total_count), names_vary = "slowest")

tabl4

Output as follows

Next steps: move the circled outputs and put them at the beginning of the table remove every other line result - left with a simple table rows = sites; columns: total count; total mean; then columns for each fg count & mean: eg 1c count; 1c mean; 1d count; 1d mean.

I'd like to share some data as I know that helps provide some context to the Question, but I've no idea how to do that. Happy to hear suggestions, thanks. Currently in a .csv file. — Bramble95, Dec 07 '22 at 18:06
Here's the canonical answer on sharing example data for R: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. Short answer, it's often easiest to run `dput(head(dat))` and paste the output into your question. — Jon Spring, Dec 07 '22 at 18:13
It might also be simpler to use a standard R dataset, like `mtcars`, where people are more likely to be already familiar with the structure. In the case of mtcars, you might frame your question as "what is the average mpg within each gear group, and also what is the average mpg within each gear-cyl group?" — Jon Spring, Dec 07 '22 at 18:19
Thanks. It looks like in this case that data isn't very representative -- there are X + site numbers but other columns all 0 or NA. — Jon Spring, Dec 07 '22 at 18:28
Does my answer below (using other data) get at the structure of your question? If not, can you pls clarify? — Jon Spring, Dec 07 '22 at 18:29
Yes there are lots of zeros, and where there is a zero under 'count' or 'count_l_site', the f_grp value will be an NA — Bramble95, Dec 07 '22 at 18:29
`f_grp` is defined in your detail. Your totals should also have a `f_grp` value, but you are instead defining `fg`. That's why your detail has NA for `fg` in the detail and you have `NA` columns from the totals. I expect it should work if you make it consistent. — Jon Spring, Dec 08 '22 at 19:24

Jon Spring · Accepted Answer · 2022-12-07T18:32:15.350

Something like this?

library(dplyr)

avg_gear <- mtcars %>%
  group_by(gear) %>%
  summarize(avg_mpg = mean(mpg), n = n()) %>%
  mutate(cyl = "total")

avg_gear_cyl <- mtcars %>%
  group_by(gear,cyl = as.character(cyl)) %>%
  summarize(avg_mpg = mean(mpg), n = n())

bind_rows(avg_gear, avg_gear_cyl) %>%
  arrange(gear, cyl)

# A tibble: 11 × 4
    gear avg_mpg     n cyl  
   <dbl>   <dbl> <int> <chr>
 1     3    21.5     1 4    
 2     3    19.8     2 6    
 3     3    15.0    12 8    
 4     3    16.1    15 total
 5     4    26.9     8 4    
 6     4    19.8     4 6    
 7     4    24.5    12 total
 8     5    28.2     2 4    
 9     5    19.7     1 6    
10     5    15.4     2 8    
11     5    21.4     5 total

Or if you want categories as columns:

bind_rows(avg_gear, avg_gear_cyl) %>%
  arrange(gear, desc(cyl)) %>%
  tidyr::pivot_wider(names_from = cyl, values_from = c(avg_mpg, n), names_vary = "slowest")

# A tibble: 3 × 9
   gear avg_mpg_total n_total avg_mpg_8   n_8 avg_mpg_6   n_6 avg_mpg_4   n_4
  <dbl>         <dbl>   <int>     <dbl> <int>     <dbl> <int>     <dbl> <int>
1     3          16.1      15      15.0    12      19.8     2      21.5     1
2     4          24.5      12      NA      NA      19.8     4      26.9     8
3     5          21.4       5      15.4     2      19.7     1      28.2     2

Thanks Jon. I need to go away and work out what you've done. I think the last categories as columns could work. Thanks, I will come back here once I've tried it. — Bramble95, Dec 07 '22 at 18:37
Making some progress, but not quite right yet. I can't seem to get the total values on the same line in the way that you can with the mt_cars example. I'll amend the original Q with the code and table output — Bramble95, Dec 08 '22 at 15:21

Summary means using dplyr

1 Answers1