0

I am looking to calculate proportions from count data based on multiple variables. In the example dataset below, I want to know the proportion of each score for each species at each date. For example, for all MCAP on 2019-09-16, what proportion are 2's?

structure(list(date = structure(c(18155, 18155, 18155, 18155, 
18155, 18155, 18155, 18155, 18171, 18171, 18171, 18185, 18185, 
18185, 18185, 18185, 18185, 18185, 18185, 18185), class = "Date"), 
species = c("MCAP", "MCAP", "PCOM", "MCAP", "MCAP", "MCAP", 
"PCOM", "PCOM", "PCOM", "PCOM", "PCOM", "MCAP", "MCAP", "MCAP", 
"MCAP", "PCOM", "PCOM", "PCOM", "PCOM", "PCOM"), score = c(2, 
2, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3)), row.names = c(1L, 
2L, 3L, 4L, 8L, 33L, 37L, 38L, 7912L, 7931L, 7947L, 8543L, 8544L, 
8545L, 8547L, 12946L, 12947L, 12948L, 12949L, 12950L), class = "data.frame")
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • Could you include what has already failed? Is this what you need: `with(df, table(species,score))`? – NelsonGon Apr 16 '20 at 14:12
  • Try this? `df %>% group_by(species, score, date) %>% summarise(N = n()/ sum(n()))` – NelsonGon Apr 16 '20 at 14:19
  • @NelsonGon, that doesn't work because it's still grouping by the score, so all the proportions are going to be 1. You need to group by all to get the counts, then only group by the species and date to get the proportion. – Matt Dzievit Apr 16 '20 at 14:22
  • @Matt D your approach worked by first grouping by counts and then by species/date. Thank you –  Apr 16 '20 at 14:30
  • @MattD I don't quite get it. Could you add some detailed explanation in your answer for my future self? – NelsonGon Apr 16 '20 at 14:35
  • 1
    @NelsonGon - Let's use the example MCAP, 2019-09-16. Your answer is grouping by those first 2 instances and also the scores (so a group for score 2 and a group for score 3). Now your solutions calculates the number of rows that are in those groups and then divides by the sum of all the rows (sum(n()) This by definition is always going to be one because you grouped all the data meaning n() = sum(n()) df %>% group_by(species, score) %>% summarise(N = n()) - run this and you'll see it calculates the number of obs for each species and score. Compare to your code above. Hope this helps. – Matt Dzievit Apr 16 '20 at 14:52

2 Answers2

2

tidyverse, specifically dplyr package can help with this. I assume there's many ways using these packages to solve this problem, but this is the first one that came to mind.

library(tidyverse)

data <- structure(list(date = structure(c(18155, 18155, 18155, 18155, 
                              18155, 18155, 18155, 18155, 18171, 18171, 18171, 18185, 18185, 
                              18185, 18185, 18185, 18185, 18185, 18185, 18185), class = "Date"), 
           species = c("MCAP", "MCAP", "PCOM", "MCAP", "MCAP", "MCAP", 
                       "PCOM", "PCOM", "PCOM", "PCOM", "PCOM", "MCAP", "MCAP", "MCAP", 
                       "MCAP", "PCOM", "PCOM", "PCOM", "PCOM", "PCOM"), score = c(2, 
                                                                                  2, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3)), row.names = c(1L, 
                                                                                                                                                           2L, 3L, 4L, 8L, 33L, 37L, 38L, 7912L, 7931L, 7947L, 8543L, 8544L, 
                                                                                                                                                           8545L, 8547L, 12946L, 12947L, 12948L, 12949L, 12950L), class = "data.frame")

data_sum <- data %>%
  ##This groups and counts the occurrences
  group_by(species,date,score) %>% 
  tally() %>% 
  ungroup() %>% 
  ##this groups and calculates the proportion for the groups above without the score.
  group_by(species,date) %>% 
  mutate(prop = n/sum(n))

data_sum %>%
  filter(date == '2019-09-16',
         species == 'MCAP')

  species date       score     n  prop
  <chr>   <date>     <dbl> <int> <dbl>
1 MCAP    2019-09-16     2     3   0.6
2 MCAP    2019-09-16     3     2   0.4
Matt Dzievit
  • 527
  • 3
  • 10
  • @Gregor - it is correct. What I filtered above was the example he asked for. The 2019-09-16 date and the MCAP species. Your example above is species MCAP and 2019-10-16, so if you look at my table below you will see 50% proportion for each score and the counts that match what you said above. species date score n prop 1 MCAP 2019-09-16 2 3 0.6 2 MCAP 2019-09-16 3 2 0.4 3 MCAP 2019-10-16 2 2 0.5 4 MCAP 2019-10-16 3 2 0.5 – Matt Dzievit Apr 16 '20 at 14:29
1

In base R, we can do this with table and prop.table. You can adjust the margin argument to change what the denominator is for the proportions. The code is short and the result is relatively nice for a table display in the console, not so nice because it's not a data frame for incorporating into additional analysis. For that, I'd suggest the dplyr approach in the other answer.

with(d, prop.table(table(species, score, date), margin = c(1, 3)))
# , , date = 2019-09-16
# 
#        score
# species         2         3
#    MCAP 0.6000000 0.4000000
#    PCOM 0.3333333 0.6666667
# 
# , , date = 2019-10-02
# 
#        score
# species         2         3
#    MCAP                    
#    PCOM 0.3333333 0.6666667
# 
# , , date = 2019-10-16
# 
#        score
# species         2         3
#    MCAP 0.5000000 0.5000000
#    PCOM 0.0000000 1.0000000
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • What is the use of `margin` here? Is there a way to automate it(like get it from the data itself)? – NelsonGon Apr 16 '20 at 14:36
  • 1
    `margin` refers to the dimensions. In 2-dimensional table, margin 1 would be rows, and margin 2 would be columns, so if you do, e.g, `prop.table(table(d$species, d$score), margin = 1)` the rows (species) will add up to 1. If you did `margin = 2` the columns (score) will add up to 1. In the 3-dimensional table in this example, OP wants the proportion of *species* within a *date* to add up to 1. My `c(1, 3)` margin is because `species` was my first argument and `date` was my 3rd argument. – Gregor Thomas Apr 16 '20 at 17:01