How to find percentage of a sub-group using dplyr

Question

I have a list of cities and within them a certain number of both open and closed stores. I want to use dplyr to find the percentage of closed stores for each city. Store.status is either "A" if active/open or "I" if inactive/closed.

So far I have:

df %>% 
  group_by(City,Store.Status) %>% 
  summarise(Num_stores=n(),
  Percent_closed = ??? / Num_stores) %>%   
  arrange(desc(Num_stores))

Output:

City    Store Status        Num_store           Percent_closed
--------------------------------------------------------------
Chicago        O                 88                (???/88)
Denver         O                 78                (???/78)
New York       O                 56                (???/56)

I'm getting back just the total number of stores in each city along with an "0" in the store status even though it's the sum of all open and closed stores. I don't know how to find the percentage of closed stores in each city, which would simply be the total number of closed stores divided by the total stores in the city. I know this is really simple but I'm new to R/coding in general and can't find anything that will help.

Ideally, the output would be :

City      Num_store       Percent_closed
--------------------------------------------------------------
Ackley       3                (1/3)
Adair        2                (0/2)

Where Num_store is the total number of stores, active or inactive. Percent_closed is the total number inactive("I" in Store.Status) over the total number of stores.

For example, there's two Ackley stores with an "A" and one with "I" so 1 inactive / 3 total.

> dput(head(iowa, 20))
structure(list(Store = c(2656L, 2657L, 2674L, 2835L, 2954L, 3013L, 
3041L, 3045L, 3162L, 3354L, 3385L, 3886L, 3738L, 3740L, 3831L, 
3833L, 3838L, 3872L, 3873L, 3879L), Name = c("Hy-Vee Food Store / Corning", 
"Hy-Vee Food Store / Bedford", "Hy-Vee Food Store / Lamoni", 
"CVS Pharmacy #8538 / Cedar Falls", "Dahl's / Ingersoll", "Keith's Foods", 
"Shugar's Super Valu / Colfax", "Britt Food Center", "Nash Finch / Wholesale Food", 
"Sam's Club 8238 / Davenport", "Sam's Club 8162 / Cedar Rapids", 
"Wal-Mart 0646 / Anamosa", "Hartig Drug Company #4 / Dubuque", 
"Hometown Foods / Waterloo", "The Market Of Madrid", "Wal-Mart 3394 / Atlantic", 
"Schnucks / Bettendorf", "Target Store T-0086 / Dubuque", "Target Store T-1113 / Coralville", 
"Target Store T-1800 / Sioux City"), Store.Status = c("A", "A", 
"A", "A", "I", "A", "A", "A", "I", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A"), Address = c("300  10th St", "1604 Bent", 
"720 East Main", "2302 West First St", "3425 Ingersoll", "207 E Locust St", 
"28 E Howard", "8 2nd St NW", "807 Grandview", "3845 Elmore Ave.", 
"2605 Blairs Ferry Rd NE", "101  115 St", "2225 Central Ave", 
"1010 E Mitchell Ave", "301 Annex Rd", "1905 East 7th St", "858 Middle Rd", 
"3500 Dodge St", "1441 Coral Ridge Ave", "5775 Sunnybrook Dr"
), City = c("Corning", "Bedford", "Lamoni", "Cedar Falls", "Des Moines", 
"Bloomfield", "Colfax", "Britt", "Muscatine", "Davenport", "Cedar Rapids", 
"Anamosa", "Dubuque", "Waterloo", "Madrid", "Atlantic", "Bettendorf", 
"Dubuque", "Coralville", "Sioux City"), State = c("IA", "IA", 
"IA", "IA", "IA", "IA", "IA", "IA", "IA", "IA", "IA", "IA", "IA", 
"IA", "IA", "IA", "IA", "IA", "IA", "IA"), Zip.Code = c("51632", 
"50833", "50140", "50613", "50300", "52537", "50054", "50423", 
"52761", "52807", "52402", "52205", "52001", "50702", "50156", 
"50022", "52722", "52003", "52241", "51106"), Store.Address = c("300 10th St\nCorning, IA 51632\n(40.991861, -94.731809)", 
"1604 Bent\nBedford, IA 50833\n(40.676171, -94.725578)", "720 East Main\nLamoni, IA 50140\n(40.623647, -93.924475)", 
"2302 West First St\nCedar Falls, IA 50613\n(42.539874, -92.472778)", 
"3425 Ingersoll\nDes Moines, IA 50300\n(41.586313, -93.663337)", 
"207 E Locust St\nBloomfield, IA 52537\n(40.752691, -92.412847)", 
"28 E Howard\nColfax, IA 50054\n(41.677932, -93.244443)", "8 2nd St NW\nBritt, IA 50423\n(43.098696, -93.801917)", 
"807 Grandview\nMuscatine, IA 52761\n(41.408437, -91.064113)", 
"3845 Elmore Ave.\nDavenport, IA 52807\n(41.559731, -90.527081)", 
"2605 Blairs Ferry Rd NE\nCedar Rapids, IA 52402\n(42.031819, -91.67969)", 
"101 115 St\nAnamosa, IA 52205", "2225 Central Ave\nDubuque, IA 52001\n(42.51384, -90.671531)", 
"1010 E Mitchell Ave\nWaterloo, IA 50702\n(42.476639, -92.343118)", 
"301 Annex Rd\nMadrid, IA 50156\n(41.87894, -93.815365)", "1905 East 7th St\nAtlantic, IA 50022\n(41.403853, -94.98571)", 
"858 Middle Rd\nBettendorf, IA 52722\n(41.539421, -90.520013)", 
"3500 Dodge St\nDubuque, IA 52003\n(42.491944, -90.720589)", 
"1441 Coral Ridge Ave\nCoralville, IA 52241\n(41.691276, -91.608399)", 
"5775 Sunnybrook Dr\nSioux City, IA 51106\n(42.448939, -96.332446)"
), Report.Date = c("10/01/2018", "10/01/2018", "10/01/2018", 
"10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", 
"10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", 
"10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", "10/01/2018", 
"10/01/2018", "10/01/2018"), Inactive = c(FALSE, FALSE, FALSE, 
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)), row.names = c(NA, 
20L), class = "data.frame")

Welcome to Stack Overflow. You'll get better help if you [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format - for example the output from `dput(df)`, if that is not too large. — neilfws, Apr 05 '23 at 01:08
Thanks for the edit, it's my first post I wasn't sure how to do it. — Cobra, Apr 05 '23 at 01:10
Strangely, I can't find a duplicate for what I think is the core of the question - do you essentially want to know that you can do `sum(!variable)` to count the number of `FALSE` values? This is close - https://stackoverflow.com/questions/40996611/which-indices-are-false - but not quite the same. — thelatemail, Apr 05 '23 at 01:51
Yeah I couldn't find anything that could help either. Maybe my explanation wasn't the best but I think the core of the question is what you're getting at. Still not sure how to use it in dplyr but I'll keep looking. — Cobra, Apr 05 '23 at 02:01

akrun · Answer 1 · 2023-04-05T05:59:30.130

The percentage closed can be created by constructing a logical vector and take the mean of the vector

library(dplyr)
df %>% 
  group_by(City,Store.Status) %>% 
  summarise(Num_stores=n(), .groups = "drop_last") %>%  
  mutate(Percent_closed =  mean(Store.Status == "C", na.rm = TRUE)
    ) %>%   
  ungroup %>%
  arrange(desc(Percent_closed))

-output

# A tibble: 19 × 4
   City         Store.Status Num_stores Percent_closed
   <chr>        <chr>             <int>          <dbl>
 1 Anamosa      A                     1              0
 2 Atlantic     A                     1              0
 3 Bedford      A                     1              0
 4 Bettendorf   A                     1              0
 5 Bloomfield   A                     1              0
 6 Britt        A                     1              0
 7 Cedar Falls  A                     1              0
 8 Cedar Rapids A                     1              0
 9 Colfax       A                     1              0
10 Coralville   A                     1              0
11 Corning      A                     1              0
12 Davenport    A                     1              0
13 Des Moines   I                     1              0
14 Dubuque      A                     2              0
15 Lamoni       A                     1              0
16 Madrid       A                     1              0
17 Muscatine    I                     1              0
18 Sioux City   A                     1              0
19 Waterloo     A                     1              0

Or if we want to use the OP's expression, the count can be the sum of logical expression or the length of the subset (length(Store.Status[Store.Status %in% "C"]))

df %>% 
  group_by(City,Store.Status) %>% 
  summarise(Num_stores=n(), .groups = "drop_last") %>%
  mutate(
  Percent_closed =  sum(Store.Status == "C", na.rm = TRUE)/Num_stores
   ) %>%   
  ungroup %>%
  arrange(desc(Percent_closed))

Didn't work :( Pretty much getting the same output I was before. I do think the logical vector is the right way to go though I'll keep trying little variations. — Cobra, Apr 05 '23 at 02:46
@BrettSimmons I guess I understand the reason. I didn't notice the group_by with Store.Status earlier. Can you try the update. It is a bit hard to test without any example data — akrun, Apr 05 '23 at 02:52
I added a screenshot of the data and clarified it a bit better I think. — Cobra, Apr 05 '23 at 03:12
@Cobra i would request to add dput of the data i.e. `dput(head(yourdata, 20))` instead of images as we cannot copy/paste and test from image — akrun, Apr 05 '23 at 04:03
@Cobra Based on the data provided, I get 0 values because there was no "C" value in the 'Store.Status' data you showed. I updated with the output I got — akrun, Apr 05 '23 at 06:00

How to find percentage of a sub-group using dplyr

1 Answers1