How can I calculate proportional use with dplyr?

Question

I need to calculate proportional use of treatments within each trial. However, my code produces incorrect values. Also, I have a few trials where treatments were not used (e.g. Trial 7 - Control) that are populating with Nan. How can I input a value of 0 for trials that were not used. Output pictured below. Thanks

[![enter image description here][1]][1]

 filter(Use==1) %>%
 group_by(Trial,Treatment,Use, .drop=F) %>%
 summarise(trial_n=n()) %>%
 mutate(Prop = n / trial_n * 100)``` 


   `structure(list(DateTime = structure(c(1596994200, 1596994200, 
   1596994200, 1596994200, 1596994200, 1596994200, 1596994200, 
   1596994200, 
   1596995100, 1596995100), class = c("POSIXct", "POSIXt"), tzone = ""), 
   Treatment = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 
   1L), .Label = c("30%Shade", "60%Shade", "90%Shade", "Control"
   ), class = "factor"), Pyranometer = structure(c(NA_integer_, 
   NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
   NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("1", 
   "2", "3", "4"), class = "factor"), Irradiance = c(NA_real_, 
   NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
   NA_real_, NA_real_, NA_real_), Total_Solar_Flux = c(NA_real_, 
   NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
   NA_real_, NA_real_, NA_real_), Date = structure(c(18483, 
   18483, 18483, 18483, 18483, 18483, 18483, 18483, 18483, 18483
   ), class = "Date"), Time = c("12:30:00", "12:30:00", "12:30:00", 
   "12:30:00", "12:30:00", "12:30:00", "12:30:00", "12:30:00", 
   "12:45:00", "12:45:00"), Year = structure(c(1L, 1L, 1L, 1L, 
   1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "2"), class = "factor"), 
   Trial = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
   ), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", 
   "10", "11", "12", "13", "14"), class = "factor"), Id = c("Dianne", 
   "June", "Dianne", "June", "Dianne", "June", "Dianne", "June", 
   "Dianne", "June"), Ambient_F = c(95.8, 95.8, 95.8, 95.8, 
   95.8, 95.8, 95.8, 95.8, 96.6, 96.6), BGtemp_F = c(112.5, 
   112.5, 106.2, 106.2, 99.7, 99.7, 118.4, 118.4, 113.4, 113.4
   ), Use = c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), Ambient_C = 
   c(35.44, 
   35.44, 35.44, 35.44, 35.44, 35.44, 35.44, 35.44, 35.89, 35.89
   ), BGtemp_C = c(44.72, 44.72, 41.22, 41.22, 37.61, 37.61, 
   48, 48, 45.22, 45.22)), row.names = c(NA, 10L), class = 
   "data.frame")`


 [1]: https://i.stack.imgur.com/gP3Nk.png

I think you're looking for `dplyr::group_by(.drop=F)`. See [here](https://stackoverflow.com/a/54757520) — Dan Adams, Jan 29 '22 at 00:52
@Dan Adams, ```.drop=F``` worked! However, I can't get my proportions to add to anything more than 1 now. — JLD475, Jan 31 '22 at 15:31
Um... isn't that the definition of a fraction? The formula `x[i]/sum(x)` can't exceed 1 as long as all your values are non-negative real numbers. — Dan Adams, Jan 31 '22 at 15:40
The post should have said 'all rows in the Prop column now have a value of 1'. I have tried using ```dput(head(mydataframe, 10)) ``` to add a reproducible example, but I keep being told I have code that is not formatted properly. — JLD475, Jan 31 '22 at 15:50
So the `Prop` is supposed to add up to `100` for each trial? — Dan Adams, Jan 31 '22 at 17:14

score 1 · Accepted Answer · answered Jan 31 '22 at 17:24

First you can group by Use, Trial and Treatment to get a separate count for each, then dplyr::summarize() as in your attempts. However, then you need to regroup_by() so that your Prop calculation occurs per-Trial. When sum(trial_n) == 0 then the result will be NaN because of zero division. The final step uses dplyr::if_else() to convert these values to 0 if that's what you need.

library(tidyverse)

d %>% 
  group_by(Trial, Use, Treatment, .drop=F) %>%
  summarise(trial_n = n()) %>%
  group_by(Trial, Use) %>% 
  mutate(Prop = trial_n/sum(trial_n) * 100) %>% 
  mutate(Prop = if_else(is.nan(Prop), 0, Prop))
#> `summarise()` has grouped output by 'Trial', 'Use'. You can override using the `.groups` argument.
#> # A tibble: 60 x 5
#> # Groups:   Trial, Use [15]
#>    Trial   Use Treatment trial_n  Prop
#>    <fct> <int> <fct>       <int> <dbl>
#>  1 1         0 30%Shade        3  37.5
#>  2 1         0 60%Shade        2  25  
#>  3 1         0 90%Shade        1  12.5
#>  4 1         0 Control         2  25  
#>  5 1         1 30%Shade        1  50  
#>  6 1         1 60%Shade        0   0  
#>  7 1         1 90%Shade        1  50  
#>  8 1         1 Control         0   0  
#>  9 2        NA 30%Shade        0   0  
#> 10 2        NA 60%Shade        0   0  
#> # ... with 50 more rows

^{Created on 2022-01-31 by the reprex package (v2.0.1)}

Data:

d <- structure(
  list(
    DateTime = structure(
      c(
        1596994200,
        1596994200,
        1596994200,
        1596994200,
        1596994200,
        1596994200,
        1596994200,
        1596994200,
        1596995100,
        1596995100
      ),
      class = c("POSIXct", "POSIXt"),
      tzone = ""
    ),
    Treatment = structure(
      c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L,
        1L),
      .Label = c("30%Shade", "60%Shade", "90%Shade", "Control"),
      class = "factor"
    ),
    Pyranometer = structure(
      c(
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_,
        NA_integer_
      ),
      .Label = c("1",
                 "2", "3", "4"),
      class = "factor"
    ),
    Irradiance = c(
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_
    ),
    Total_Solar_Flux = c(
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_,
      NA_real_
    ),
    Date = structure(
      c(
        18483,
        18483,
        18483,
        18483,
        18483,
        18483,
        18483,
        18483,
        18483,
        18483
      ),
      class = "Date"
    ),
    Time = c(
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:30:00",
      "12:45:00",
      "12:45:00"
    ),
    Year = structure(
      c(1L, 1L, 1L, 1L,
        1L, 1L, 1L, 1L, 1L, 1L),
      .Label = c("1", "2"),
      class = "factor"
    ),
    Trial = structure(
      c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
      .Label = c(
        "1",
        "2",
        "3",
        "4",
        "5",
        "6",
        "7",
        "8",
        "9",
        "10",
        "11",
        "12",
        "13",
        "14"
      ),
      class = "factor"
    ),
    Id = c(
      "Dianne",
      "June",
      "Dianne",
      "June",
      "Dianne",
      "June",
      "Dianne",
      "June",
      "Dianne",
      "June"
    ),
    Ambient_F = c(95.8, 95.8, 95.8, 95.8,
                  95.8, 95.8, 95.8, 95.8, 96.6, 96.6),
    BGtemp_F = c(112.5,
                 112.5, 106.2, 106.2, 99.7, 99.7, 118.4, 118.4, 113.4, 113.4),
    Use = c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L),
    Ambient_C =
      c(
        35.44,
        35.44,
        35.44,
        35.44,
        35.44,
        35.44,
        35.44,
        35.44,
        35.89,
        35.89
      ),
    BGtemp_C = c(44.72, 44.72, 41.22, 41.22, 37.61, 37.61,
                 48, 48, 45.22, 45.22)
  ),
  row.names = c(NA, 10L),
  class =
    "data.frame"
)

score 0 · Answer 2 · answered Jan 28 '22 at 20:24

0

Would you be so kind and put a subset of your data, please? like dput(head(MyDataFrame, 10)). I will help. If I understand mutate and group_by shall be sufficient.

You can:

  group_by(Year) |> 
  mutate(year_n = sum(n()))
  ungroup()

To get the total number in year. Then

  group_by(Year,Trial,Treatment,Use) %>%
  mutate(Prop = ifelse(Use == 0, 0, 100 * n() / year_n)) %>%
  [...]

answered Jan 28 '22 at 20:24

Grzegorz Sapijaszko

1,913
1
5
12

I appreciate your help. After running ```dput(head(captive_final,10)``` do I copy and paste the console output to here? It seems to be a large amount of information. – JLD475 Jan 28 '22 at 21:14
1

Edit your question and add it there. It's ok if it's a bit long. – Dan Adams Jan 28 '22 at 23:16

How can I calculate proportional use with dplyr?

2 Answers2

Data: