1

I have the following dataset with the following variables indicating whether a person used their phone (a dummy variable with 1 = used the phone ("Yes") and 0 ("No") else); their ID and district and sub-district they live in. Note that a same person may have been recorded twice or more under different sub-districts. However, I only want to count such a person once, that is, consider only unique IDs.

district sub_district   id  used_phone
    A   SX  1   Yes
    A   SX  2   Yes
    A   SX  3   No
    A   SX  4   No
    A   SY  4   No
    A   SY  5   Yes
    A   SZ  6   Yes
    A   SX  6   Yes
    A   SZ  7   No
    B   RX  8   No
    B   RV  9   No
    B   RX  9   No
    B   RV  10  Yes
    B   RV  11  Yes
    B   RT  12  Yes
    B   RT  13  Yes
    B   RV  13  Yes
    B   RT  14  No
    B   RX  14  No
  

N.B: used_phone is a factor variable

For the above dataset, I want to plot a distribution of "whether a person used a phone" for which I was using the following code:

  ggplot(df, aes(x=used_phone)) +
  geom_bar(color = "black", fill = "aquamarine4", position = "dodge") +
  labs(x="Used phone", y = "Number of people") +
  ggtitle("Whether person used phone") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)))
  

This code works fine. However, I want to do two things:

  1. Add % labels for each group (yes & no) over the respective bars but y-axis to show the "count"
  2. Plot the graph such that it only considers the unique IDs

Looking forward to solving this with your help as I am novice in R.

Thanks, Rachita

Rachita
  • 37
  • 7
  • Could you please include a minimal subset of your data as a dataframe object? Maybe use `dput(df)` . This allows potential solutions to be tested and verifited. Have a look at [mre]. – Peter Jun 22 '20 at 17:08
  • Thanks for the suggestion, Peter! I have updated the dataset. Unfortunately, I can not post the original dataset hence, made one up for your review. Hope this is fine. – Rachita Jun 22 '20 at 17:20
  • Does this answer your question? [Adding percentage labels to a bar chart in ggplot2](https://stackoverflow.com/questions/40249943/adding-percentage-labels-to-a-bar-chart-in-ggplot2) – 4redwood Jun 22 '20 at 17:52
  • As for the unique ids, look into using something like `df[!duplicated(df$id),]` – 4redwood Jun 22 '20 at 18:00
  • Hi, @4redwood: the said link was not helpful in my case. However, thanks for pointing it out! – Rachita Jun 23 '20 at 09:26

2 Answers2

1

Here is one suggestion that could work:

  1. Summarize your df based on used_phone and count total number of people who have either used phone and not.
  2. Based on the summarized count, you can calculate percent share and with that you can add label cloumn which is just percent share with % sign
  3. You can plot using ggplot and using the new summarized df. You can use geom_text() to add percentage labels at the top of bars, use vjust argument in position_stack() to play around with label's position.
df %>% 
distinct(.keep_all = T) %>%
  group_by(used_phone) %>% 
  summarize(n()) %>% 
  setNames(., c('used_phone', 'count')) %>% 
  mutate('share' = count/sum(count),
         'label' = paste0(round(share*100, 2), '%')) -> df

  ggplot(df, aes(y=count, x=used_phone)) +
  geom_bar(stat='identity',
           color = "black", 
           fill = "aquamarine4", 
           position = "dodge") +
  geom_text(aes(label = label),
            position = position_stack(vjust = 1.02),
            size = 3) +
  labs(title = 'Whether person used phone',
       x = 'Used Phone',
       y = 'Number of People') +
  theme_bw()

Plot

monte
  • 1,482
  • 1
  • 10
  • 26
  • The original DF had 19 rows, all with either *yes* or *no* in `used_phone`. So there can not be exact 50% usage rate for both. And @Rachita wanted to count only unique id's. – MarBlo Jun 23 '20 at 04:23
  • I created this answer before the question was updated with new data, thanks for pointing out the distinct condition, I have modified the code to include that. – monte Jun 23 '20 at 04:58
  • Thanks, both!! The code works fine except the distinct condition - it is still counting the number of yes/no as per the non-unique IDs. I tried both of your versions. Any idea to work around with this? – Rachita Jun 23 '20 at 08:40
1

As the duplicates in id are id's living in different sub_district at the same time and you want to not double count them, I delete the variable sub_district. Then erase all duplicates, count the phones and calculate the percentage. The DF coming from this is shown. ggplot is with geom_col and the percentage on the axis with scales.

I have commented out two lines of code which allows you to facet for district in your ggplot. The diagram coming out of this is attached at the bottom.

library(tidyverse)

df <- read.table(text="district sub_district   id  used_phone
    A   SX  1   Yes
    A   SX  2   Yes
    A   SX  3   No
    A   SX  4   No
    A   SY  4   No
    A   SY  5   Yes
    A   SZ  6   Yes
    A   SX  6   Yes
    A   SZ  7   No
    B   RX  8   No
    B   RV  9   No
    B   RX  9   No
    B   RV  10  Yes
    B   RV  11  Yes
    B   RT  12  Yes
    B   RT  13  Yes
    B   RV  13  Yes
    B   RT  14  No
    B   RX  14  No", header = T)
table(df$used_phone)
#> 
#>  No Yes 
#>   9  10

ddf <- df %>%
  select(-sub_district) %>%        # delete sub_district
  distinct(id, .keep_all = T) %>%  # unique id`s`
  #group_by(district) %>% 
  count(used_phone) %>%            # cout phones
  mutate(pct = n / sum(n))         # calculate percentage

ddf
#> # A tibble: 2 x 3
#>   used_phone     n   pct
#>   <chr>      <int> <dbl>
#> 1 No             6 0.429
#> 2 Yes            8 0.571

ggplot(ddf, aes(used_phone, pct, fill = used_phone)) +
  geom_col(position = 'dodge') + 
  #facet_wrap(~district) +
  scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
  scale_y_continuous(labels = scales::percent_format())

enter image description here


New Addition based on comment:
  • wants y-axis in counts
  • wants percentage as labels over the bar
  • wants as facet for district
ddf <- df %>%
  select(-sub_district) %>%        # delete sub_district
  distinct(id, .keep_all = T) %>%  # unique id`s`
  group_by(district) %>% 
  count(used_phone) %>%            # cout phones
  mutate(pct = n / sum(n),         # calculate percentage
         label = paste0(round(pct*100, 2), '%'))     

ggplot(ddf, aes(used_phone, n, fill = used_phone)) +
  geom_col(position = 'dodge') + 
  facet_wrap(~district) +
  scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
  geom_text(aes(label = label),
           position = position_stack(vjust = 1.05),
           size = 3) +
  labs(y='count')

enter image description here


*new addition* change the basis for percent
ddf <- df %>%
  select(-sub_district) %>%        # delete sub_district
  distinct(id, .keep_all = T) %>%  # unique id`s`
  mutate(ssum = n()) %>% 
  group_by(district) %>% 
  count(used_phone, ssum) %>%            # cout phones
  mutate(pct = n / ssum,         # calculate percentage
         label = paste0(round(pct*100, 2), '%'))

I have introduced a new variable which sums the numbers up before grouping. That gives: enter image description here

MarBlo
  • 4,195
  • 1
  • 13
  • 27
  • Thanks, MarBlo. However, I would like to have my y-axis as "count" and not %. Percentages can be as labels over the bars. Could you help me do that for the facet plot for districts? – Rachita Jun 23 '20 at 08:49
  • @Rachita I made a new edit and think this is what you want. The DF has identical values for the 2 districts. – MarBlo Jun 23 '20 at 09:19
  • yes, I ran it and have got the bar labels! However, the percentages within the districts are not quite right as they are calculated with denominator = total number of observations in district A + B. I would want the denominator to be the total number of observations in the respective district. Could you tweak your code based on this? – Rachita Jun 23 '20 at 09:37
  • @Rachita I have an additional edit. Please, have look. – MarBlo Jun 23 '20 at 10:00
  • thanks so much, MarBlo!! You're a savior. However, `mutate(ssum = n()) %>% ` should come after grouping by district. I got it correct after tweaking that! – Rachita Jun 23 '20 at 10:43
  • @Rachita you are welcome. I am happy it helped. Voting up is an option – MarBlo Jun 23 '20 at 10:45
  • I have done that! However, my reputation yet doesn't allow my upvote to be showcased publicly. – Rachita Jun 23 '20 at 11:31
  • @Rachita I recognized that you changed the accepted answer from mine to monte's answer. May I ask why? – MarBlo Jun 23 '20 at 17:14
  • Sorry, I am new to this platform -- thought we could accept multiple answers. Could I request you to review another question of mine and see if could help in that? – Rachita Jun 23 '20 at 17:16
  • @Rachita SO is a very friendly and helpful platform. The people on SO helped me a lot. I am sure, if you post your question you will get help. If I have an answer to your question I will try to contribute. Your style of asking makes me believe, that you will get answers. – MarBlo Jun 23 '20 at 17:49
  • Thank you for your encouraging words! Another question of mine on which I am seeking help is here: https://stackoverflow.com/questions/62538923/splitting-multiple-date-and-time-variables-naming-those-as-per-original-variab – Rachita Jun 23 '20 at 18:11