Counting Based on Values

Question

I have a dataset and would like to count 3 instances:

Percent of rows where Coins is greater than 20
Percent of rows where Coins is equal to 20
Percent of rows where Coins is less than 20

Here's a sample from the much larger dataset:

Plan   Year   Coins   Copay
 A     2018     20      10
 B     2014     15       5
 C     2012     30       0
 D     2017     30      10
 E     2018     5       10
 F     2018     20       0
 G     2018     20       0
 H     2016     20      10
 I     2014     10       3
 J     2017     20       7

Thus, I would like to have the following counts (based on conditions and sample dataset above)

20% (2 plans out of 10 meet the condition - C,D)
50% (5 plans out of 10 meet the condition - A,F,G,H,J)
30% (3 plans out of 10 meet the condition - B,E,I)

This is pretty close to your [previous question](https://stackoverflow.com/questions/49635569/r-counting-based-on-3-conditions), I'm confident you can take the guidance there and apply it here. What have you tried? — r2evans, Apr 05 '18 at 04:28

Gregor Thomas · Accepted Answer · 2018-04-05T05:29:26.337

I like cut for binning, and table for counting. prop.table turns the counts into proportions.

prop.table(table(cut(your_data$Coins, breaks = c(-Inf, 19.5, 20.5, Inf))))

This just gives you the proportions. You can set custom labels in cut, see the help page for details.

Using Ell's sample data:

df <- data.frame("coins" = c(20,15,30,30,5,20,20,20,10,20))
prop.table(table(cut(df$coins, breaks = c(-Inf, 19.5, 20.5, Inf))))
# (-Inf,19.5] (19.5,20.5] (20.5, Inf] 
#         0.3         0.5         0.2

You can add a * 100 if you want the results in percent rather than proportion.

akrun · Answer 2 · 2018-04-05T04:40:04.553

2

We can do this with map for three conditions

library(tidyverse)
map(c('>', "==", "<"), ~ df1 %>% 
               filter(get(.x)(Coins, 20)) %>%
                     pull(Plan))
#[[1]]
#[1] "C" "D"

#[[2]]
#[1] "A" "F" "G" "H" "J"

 #[[3]]
 #[1] "B" "E" "I"

If we need the proportions

map(c('>', "==", "<"), ~ df1 %>%
       filter(get(.x)(Coins, 20)) %>% 
       count(Plan) %>% 
       mutate(Prop = 100 *n/sum(n)) %>%
       select(-n))
#[[1]]
# A tibble: 2 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 C      50.0
#2 D      50.0

#[[2]]
# A tibble: 5 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 A      20.0
#2 F      20.0
#3 G      20.0
#4 H      20.0
#5 J      20.0

#[[3]]
# A tibble: 3 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 B      33.3
#2 E      33.3
#3 I      33.3

If the OP intended for the full dataset grouping

df1 %>%
   group_by(grp = case_when(Coins < 20 ~ 'grp1', Coins ==20 ~ 'grp2', TRUE ~ 'grp3')) %>%
   summarise(Plan = toString(unique(Plan)), prop = n()) %>%
   mutate(prop = 100 * prop/sum(prop)) %>%
   ungroup %>%
   select(-grp) 
# A tibble: 3 x 2
#   Plan           prop
#   <chr>         <dbl>
#1 B, E, I        30.0
#2 A, F, G, H, J  50.0
#3 C, D           20.0

edited Apr 05 '18 at 04:40

answered Apr 05 '18 at 04:30

akrun

874,273
37
540
662

1

Oh that's nice! I was working on a much cruder `dplyr::summarise` solution. +1 – Maurits Evers Apr 05 '18 at 04:31
I just realized it in my answer, too, but the `Prop` in `[[3]]` of your second output should be `30` and not `33`. – hpesoj626 Apr 05 '18 at 05:24
I think we can do the same without needing `Plan = toString(unique(Plan)),` – dbo Apr 05 '18 at 05:26
@hpesoj626 That is because it is based on different grouping – akrun Apr 05 '18 at 07:31
@akrun can you elaborate or give example for me? – dbo Apr 06 '18 at 01:30

rg255 · Answer 3 · 2018-04-05T20:28:26.890

I'd use the length function as a very simple option

100*(length(df$coins[df$coins > 20]) /length(df$coins))
100*(length(df$coins[df$coins == 20])/length(df$coins))
100*(length(df$coins[df$coins < 20]) /length(df$coins))

Giving

> 100*(length(df$coins[df$coins > 20]) /length(df$coins))
[1] 20
> 100*(length(df$coins[df$coins == 20])/length(df$coins))
[1] 50
> 100*(length(df$coins[df$coins < 20]) /length(df$coins))
[1] 30

If you are doing this a lot you could wrap it into a function, which you could use for other columns (d) and/or values of interest (p)

perc <- function(d, p){
    return(c(
    100*(length(d[d>p]) /length(d)),
    100*(length(d[d==p])/length(d)),
    100*(length(d[d<p]) /length(d))))
    }

perc(df$coins, 20)
perc(df$coins, 90)
perc(df$copay, 10)

This is based off a reproducible data frame of

df <- data.frame("plan" = LETTERS[1:10], "coins" = c(20,15,30,30,5,20,20,20,10,20), "copay" = c(10,5,0,10,10,0,0,10,3,7))

Side note: Given the variety of answers you got, I was curious enough to compare the approaches used. I thought that was really great to see different peoples creative approaches!

Running 10,000 times on the provided data frame, there are some considerable differences in run speed (using code presented at the time of writing). Akrun and Hpesoj626's solutions took 37 and 40 seconds respectively, Gregor's was considerably quicker at 2.1 seconds, while mine ran in 0.61 seconds. Furthermore, if you wrap it in to a function as I suggested, it takes just 0.15 seconds for 10,000 runs.

Gregor's uses fewer characters thus is a shorter script, personally I think it's very elegant (though if you are doing this many times for different values or columns the function will be the shortest approach). My only concern would be how it handles continuous data - imagine coins could take the value 20.0000000000001 - you would then have to code it as something like ...-Inf, 19.99999999999, 20.0000000000001, Inf... In other words, you have to be very careful about how you implement it.

As Gregor has noted, my version would require some more modification if you were looking to have more intervals.

Interesting comparison. I would add that both your answer and akrun's would be painful to scale if OP wanted more bins. This is a very strong reason to prefer solutions with `cut` or `findInterval` that can take an arbitrarily long vector of breaks as input. In the continuous data case those methods would also be preferred [because equality testing isn't appropriate for non-integers](https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal). Any of the methods could easily be wrapped in a function, so that's not an important point for comparison. — Gregor Thomas, Apr 05 '18 at 13:54

hpesoj626 · Answer 4 · 2018-04-05T05:28:01.473

I think akrun has covered everything you are looking for. But inspired by Gregor's answer, you can also use findInterval, then you can do some of the magic that akrun did.

df1 <- df %>% mutate(Group = findInterval(Coins, c(20, 20.5)))
df1 <- df1 %>% left_join(df1 %>% 
                    group_by(Group) %>% 
                    summarise(n = n()) %>% 
                    mutate(Prop = n / sum(n) * 100)) %>%
  select(-Group, -n)
df1

#        Plan Year Coins Copay Prop
# 1     A 2018    20    10   50
# 2     B 2014    15     5   30
# 3     C 2012    30     0   20
# 4     D 2017    30    10   20
# 5     E 2018     5    10   30
# 6     F 2018    20     0   50
# 7     G 2018    20     0   50
# 8     H 2016    20    10   50
# 9     I 2014    10     3   30
# 10    J 2017    20     7   50

You can also split by the Prop and use enframe to arrive at the same list of Plans that comprise the Prop.

df1 %>% split(.$Prop) %>%
  enframe() %>% 
  mutate(Plan = map(value, ~toString(paste(.x$Plan)))) %>% 
  unnest(Plan) %>%
  select(-value) %>%
  rename(Prop = name) %>%
  select(Plan, Prop)

#   Plan          Prop 
#   <chr>         <chr>
# 1 C, D          20   
# 2 B, E, I       30   
# 3 A, F, G, H, J 50

Counting Based on Values

4 Answers4