How to manipulate (aggregate) the data in R?

Question

I have a data set as I've shown below:

df <- tribble(
  ~id,  ~price, ~number_of_book,        
  "1",    10,         3,        
  "1",     5,         1,         
  "2",     7,         4,
  "2",     6,         2, 
  "2",     3,         4,
  "3",     4,         1,
  "4",     5,         1,
  "4",     6,         1,
  "5",     1,         2,
  "5",     9,         3,
)

As you see in the data set, there are 3 books which cost 10 dollar for each book if id is "1" and 1 book that costs 5 dollar. Basically, I want to see the share (%) the number of books for each price bin. Here is my desired data set:

df <- tribble(
  ~id,    ~less_than_three,   ~three-five,  ~five-six, ~more_than_six,     
  "1",          "0%",              "25%",     "0%",         "75%",
  "2",          "0%",              "40%",     "20%",        "40%",
  "3",          "0%",              "100%",    "0%",         "0%",  
  "4",          "0%",              "50%",     "50%",        "0%",
  "5",          "40%",             "0%",      "0%",         "60%",
)

Now, I clustered the prices first. To do this, I run the below code:

out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
           labels = c("<3","3-5","5-6", ">6")) 

out = table(out) / sum(table(out))

But unfortunately, I could not go further because of lack of coding knowledge. Would you help me to get the desired data?

score 3 · Answer 1 · answered Dec 03 '19 at 19:47

3

We can use cut to get the intervals and then using tidyr transform data to wide format and at the end using janitor add the percentages.

library(dplyr)
library(tidyr)
library(janitor)

df %>% 
  mutate(interval = cut(price, c(0,3,5,6,Inf))) %>% 
  select(-price) %>% 
  pivot_wider(names_from = interval, values_from = number_of_book) %>% 
  adorn_percentages()

#>  id (6,Inf] (3,5] (5,6] (0,3]
#>   1    0.75  0.25    NA    NA
#>   2    0.40    NA   0.2   0.4
#>   3      NA  1.00    NA    NA
#>   4      NA  0.50   0.5    NA
#>   5    0.60    NA    NA   0.4

answered Dec 03 '19 at 19:47

M--

25,431
8
61
93

You could use `values_fill = list(percentage_of_book = 0)` within `pivot_wider` to avoid NAs – akraf Dec 03 '19 at 19:55
Note that this assumes only 1 row per (interval, id) group. Try with an additional row between rows 1 and 2 with price = 4 to see what I mean. – IceCreamToucan Dec 03 '19 at 19:56
@akraf-ReinstateMonica I am aware of that, but am not fond of replacing NAs with zeros as they carry different meanings. – M-- Dec 03 '19 at 19:56
@IceCreamToucan an easy ```group_by %>% summarise``` would take care of that, as you demonstrated in your answer. Cheers +1. – M-- Dec 03 '19 at 19:59
Thank you for your help @M--, I am getting the following error in my real data set: Error in as_tabyl(dat) : at least one one of columns 2:n must be of class numeric In addition: Warning message: Values in `number_of_book` are not uniquely identified; output will contain list-cols. – datazang Dec 03 '19 at 20:02
@zineda Would you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – M-- Dec 03 '19 at 20:03
That is because when you have > 1 row per (interval, id) group (In the language of the warning, "Values in number_of_book are not uniquely identified"), the output contains list cols, and adorn_percentages cannot work with list cols. This is the issue I mentioned earlier. – IceCreamToucan Dec 03 '19 at 20:06
@M--, that was already a small sample of my real data set. "id" and "number_of_books" are integer, price is numeric if it helps you? – datazang Dec 03 '19 at 20:10
@IceCreamToucan then how to fix this? – datazang Dec 03 '19 at 20:11

IceCreamToucan · Accepted Answer · 2019-12-03T19:59:44.397

With dplyr, you can add a column cols which will be used for the column names. Then you can sum the number of books for each col in each id. Next you can compute the percent by dividing these numbers by the sum for that id, then applying scales::percent for formatting as a percent rather than decimal. Now you just need to pivot_wider giving the variables from which to get the names and values, and reorder the columns to match the original label order. (This is a little more involved than the other answer since it accounts for the case when there is >1 row for a given (id, cols/interval) pair, and janitor simplifies things)

labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")

df %>% 
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>% 
  summarise(n = sum(number_of_book)) %>% 
  group_by(id) %>% 
  mutate(pct = scales::percent(n/sum(n), 1)) %>% 
  pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>% 
  select_at(c('id', labels)) %>% 
  ungroup

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>        
# 1 1     NA              25%           NA          75%          
# 2 2     40%             NA            20%         40%          
# 3 3     NA              100%          NA          NA           
# 4 4     NA              50%           50%         NA           
# 5 5     40%             NA            NA          60%

If you want to replace the NAs with 0% (which I think makes sense in this context, and matches the output shown in the question), you can use the method mentioned in the comment below.

df %>% 
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>% 
  summarise(n = sum(number_of_book)) %>% 
  group_by(id) %>% 
  mutate(pct = scales::percent(n/sum(n), 1)) %>% 
  pivot_wider(id_cols = id, names_from = cols, values_from = pct,
              values_fill = list(pct = '0%')) %>% 
  select_at(c('id', labels)) %>% 
  ungroup

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>        
# 1 1     0%              57%           0%          43%          
# 2 2     40%             0%            20%         40%          
# 3 3     0%              100%          0%          0%           
# 4 4     0%              50%           50%         0%           
# 5 5     40%             0%            0%          60%

You could use `values_fill = list(percentage_of_book = 0)` within `pivot_wider` to avoid NAs — akraf, Dec 03 '19 at 19:55

How to manipulate (aggregate) the data in R?

2 Answers2

Linked