0

I have a data set as I've shown below:

df <- tribble(
  ~id,  ~price, ~type, ~number_of_book,        
  "1",    10,     "X",        3,    
  "1",     2,     "X",        1, 
  "1",     5,     "Y",        1,         
  "2",     7,     "X",        4,
  "2",     6,     "X",        1,
  "2",     6,     "Y",        2, 
  "3",     2,     "X",        4,
  "3",     8,     "X",        2,
  "3",     1,     "Y",        4,
  "3",     9,     "Y",        5,
)

Now, I want to answer this question: for each id and for each selected price group, what percentage of books is X and what percentage is Y? In other word, what is the distribution of the type of books for each id and price group?

To do this, first I need to have this data set as far as I visualize it in my mind:

agg_df <- tribble(
  ~type,     ~id,       ~less_than_two,    ~two-five,  ~five-six, ~more_than_six,     
    "X",      "1",              1,               0,           0,            3,
    "Y",      "1",              0,               1,           0,            0,
    "X",      "2",              0,               0,           1,            4,
    "Y",      "2",              0,               0,           2,            2,
    "X",      "3",              4,               0,           0,            2,
    "Y",      "3",              4,               0,           0,            5,
)

And then, this will be my desired data set:

desired_df <- tribble(
  ~type,     ~id,       ~less_than_two,  ~three-five,  ~five-six, ~more_than_six,     
  "X",      "1",            "100%",           "0%",          "0%",       "100%",
  "Y",      "1",              "0%",         "100%",          "0%",         "0%",
  "X",      "2",              "0%",           "0%",       "33.3%",      "66.6%",
  "Y",      "2",              "0%",           "0%",       "66.6%",       "33.3%",
  "X",      "3",             "50%",           "0%",          "0%",      "28.5%",
  "Y",      "3",             "50%",           "0%",          "0%",       "71.4%",
)

This desired data set shows me that when id is "3" and the price bin is more than six dollars there are two books in X type, but five books in Y type. So, here is the distribution: X(28.5%) and Y(71.4%).

Note: I had a similar question here, but now it is more complex manipulation that I could not manage to get it: How to manipulate (aggregate) the data in R?

I would appreciate if you could help me. Thanks in advance.

datazang
  • 989
  • 1
  • 7
  • 20
  • Sorry, I've just corrected the data. – datazang Dec 08 '19 at 23:48
  • @akrun unfortunately, the code did not give the correct answer. – datazang Dec 08 '19 at 23:52
  • @akrun I've tried both price and the number of book in your code, but it still does not give the result. :/ – datazang Dec 08 '19 at 23:58
  • 1
    Can you check if this helps ? `df %>% mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six")[findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>% group_by(id, type, price_group) %>% summarise(number_of_book = sum(number_of_book)) %>% group_by(id, price_group) %>% mutate(n = number_of_book/sum(number_of_book) * 100) %>% select(-number_of_book) %>% pivot_wider(names_from = price_group, values_from = n)` – Ronak Shah Dec 09 '19 at 01:25
  • Note that the `findInterval` was already mentioned in the comments – akrun Dec 09 '19 at 03:05
  • @akrun I also don't know why but it always adds "price" as a missing grouping variable to my real data in your code. – datazang Dec 10 '19 at 16:54
  • sorry, I can't reproduce the issue – akrun Dec 10 '19 at 16:55

3 Answers3

2

We can create a bin group with cut on the 'price' column, grouped by 'id', 'grp', create the percentage by dividing the 'number_of_book' with sum of 'number_of_book' and reshape into 'wide' format

library(dplyr)
library(tidyr)
df %>% 
  group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf), 
    c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>%
  mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>%
  select(-price, -number_of_book) %>%
  mutate(rn = row_number()) %>%
  pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>%
  select(-rn)
# A tibble: 6 x 6
# Groups:   id [3]
#  id    type  more_than_six less_than_two `three-five` `five-six`
#  <chr> <chr>         <dbl>         <dbl>        <dbl>      <dbl>
#1 1     X             100             100            0        0  
#2 1     Y               0               0          100        0  
#3 2     X             100               0            0       33.3
#4 2     Y               0               0            0       66.7
#5 3     X              28.6            50            0        0  
#6 3     Y              71.4            50            0        0  
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Ops, in my real data set, it still does not give the accurate result. I don't know why but after I run the code, the price column remained in data. – datazang Dec 09 '19 at 00:27
  • @zineda If you check the line `select(-price, -number_of_book) %>%`, it is removing the `price` column. May be you have included `price` also as grouping variable?. It is also possible that you loaded some other package with `select` function. In that case, use specificailly `dplyr::select(-price, -number_of_book)` – akrun Dec 09 '19 at 00:28
  • @zineda Please check the `str(df)` with `str(originaldata)` – akrun Dec 09 '19 at 00:31
  • "id" is integer, "price" is numerical, "type" is character, "number_of_book" is integer. Even I add the dplyr:: I am getting this in the console: Adding missing grouping variables: `price` – datazang Dec 09 '19 at 00:38
  • @zineda Not clear though as I am not able to reproduce it – akrun Dec 09 '19 at 03:00
2

We can divide price into different groups using findInterval, calculate the sum for number_of_book for each id, type and price_group and then calculate it's ratio for each id and price_group. Finally, we get the data in wider format using pivot_wider.

library(dplyr)

df %>% 
  mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six")
                      [findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>% 
  group_by(id, type, price_group) %>%
  summarise(number_of_book = sum(number_of_book)) %>% 
  group_by(id, price_group) %>% 
  mutate(n = number_of_book/sum(number_of_book) * 100) %>% 
  select(-number_of_book) %>% 
  tidyr::pivot_wider(names_from = price_group, values_from = n, 
                     values_fill = list(n = 0))

#  id    type  less_than_two more_than_six three_five five_six
#  <chr> <chr>         <dbl>         <dbl>      <dbl>    <dbl>
#1 1     X               100         100            0      0  
#2 1     Y                 0           0          100      0  
#3 2     X                 0         100            0     33.3
#4 2     Y                 0           0            0     66.7
#5 3     X                50          28.6          0      0  
#6 3     Y                50          71.4          0      0  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Maybe not the perfect solution, but an another way is to use case_when to define different categories:

library(tidyverse)
df %>% group_by(id, type, price) %>% 
  mutate(Less2 = case_when(price <= 2 ~ cumsum(number_of_book)),
                                            Three_Five = case_when(price %in% 3:5 ~ cumsum(number_of_book)),
                                            Five_six = case_when(price %in% 5:6 ~ cumsum(number_of_book)),
                                            More_six = case_when(price >6 ~ cumsum(number_of_book))) %>% 
  replace(is.na(.),0) %>% 
  ungroup(.) %>% 
  group_by(id, type) %>% 
  summarise_at(vars(Less2:More_six), ~sum(.)) %>%
  ungroup(.) %>%
  group_by(id) %>%
  mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0)) 

# A tibble: 6 x 6
# Groups:   id [3]
  id    type  Less2 Three_Five Five_six More_six
  <chr> <chr> <dbl>      <dbl>    <dbl>    <dbl>
1 1     X       100          0      0      100  
2 1     Y         0        100    100        0  
3 2     X         0          0     33.3    100  
4 2     Y         0          0     66.7      0  
5 3     X        50          0      0       28.6
6 3     Y        50          0      0       71.4
```
dc37
  • 15,840
  • 4
  • 15
  • 32
  • I would use `findInterval`.. Regarding your last `mutate`, can it be shortened with `mutate_at` i.e. `mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0))` – akrun Dec 09 '19 at 00:17
  • Thanks for the update with the `mutate_at, it makes things much nicer now. I did not know about `findInterval`, I will take a look ;) – dc37 Dec 09 '19 at 00:44
  • Similarly, you could also use `summarise_at` for this line, `summarise(Less2 = sum(Less2).....` – Ronak Shah Dec 09 '19 at 00:49
  • I've tried to code this, but it takes ages to run my real data set. :/ – datazang Dec 09 '19 at 01:01
  • Thanks @RonakShah for the tip. I'm editing my answer – dc37 Dec 09 '19 at 05:38
  • Sorry that my code is too slow. Mayne the answer from Ronak or akrun is better for your case. How big is your dataframe ? Maybe a solution involving data.table could be more efficient if it is very heavy. – dc37 Dec 09 '19 at 05:45