1

I have a panel data set with id, year and numerical column A. I want to add a column B with a factor variable indicating to which quantile group each row belongs per year, with either "lowest 33%", "33-67%" or "highest 33%".

I played around with dplyrs group_by(year) and the quantile() function but this just gave me back the value of the quantile. I also found the function quantcut(), that is supposed to be doing something similar to what I need, but I have had no success.

Data:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                          4, 4, 4, 5, 5, 5),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010)
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                          1000, 0, 0, 1000, 7000))


This is what I tried last which didn't give me anything useful:

library(tidyverse)
library(gtools)
df <- df %>% 
  mutate(B = quantcut(A, probs=seq(0.33, 0.67, 1)))
avocado1
  • 253
  • 2
  • 8

2 Answers2

2

I think this may solve your problem, using dplyrs recode you use findInterval and simply recode the values.

library(dplyr)

df |> 
  group_by(year) |>
  mutate(B = recode(findInterval(A, quantile(A, probs = seq(0.33, 1, 0.33))), 
`0` = "Lowest 33%", `1` = "Lowest 33%", `2` = "33%-67%", `3` = "Highest 33%", .default = "Highest 33%"))
|> ungroup()
     id  year     A B          
   <dbl> <dbl> <dbl> <chr>      
 1     1  2000     0 Lowest 33% 
 2     1  2005  3000 Highest 33%
 3     1  2010  1000 Lowest 33% 
 4     2  2000  5000 Highest 33%
 5     2  2005     0 Lowest 33% 
 6     2  2010  2000 33%-67%    
 7     3  2000  2000 33%-67%    
 8     3  2005  1000 33%-67%    
 9     3  2010     0 Lowest 33% 
10     4  2000     0 Lowest 33% 
11     4  2005  1000 33%-67%    
12     4  2010  7000 Highest 33%
  • Thank you, this does indeed seem to do the trick although I don't fully understand why. Why do you use seq(0.33, 1, 0.33) and not the seq() from @Quinten's solution? Also why the .default option and the`0` = "Lowest 33%"? – avocado1 Jul 30 '22 at 12:27
  • 1
    `recode(findInterval(.1.), .2.)` is the same as `as.character(cut(.1., labels=c(.2.)))`; since the OP asked for a factor variable and if they truly want `factor`s, then remove the `as.character`. – r2evans Jul 30 '22 at 14:22
1

quantile for the respective years using this code:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                        4, 4, 4),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010),
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                       1000, 0, 0, 1000, 7000))

library(dplyr)

df %>%
  mutate(B = factor(findInterval(year, c(-Inf, quantile(year, probs=c(0.33, .67)), Inf)), 
                    labels=c("lowest 33%","33-67%","highest 33%")))
#>    id year    A           B
#> 1   1 2000    0  lowest 33%
#> 2   1 2005 3000      33-67%
#> 3   1 2010 1000 highest 33%
#> 4   2 2000 5000  lowest 33%
#> 5   2 2005    0      33-67%
#> 6   2 2010 2000 highest 33%
#> 7   3 2000 2000  lowest 33%
#> 8   3 2005 1000      33-67%
#> 9   3 2010    0 highest 33%
#> 10  4 2000    0  lowest 33%
#> 11  4 2005 1000      33-67%
#> 12  4 2010 7000 highest 33%

Created on 2022-07-30 by the reprex package (v2.0.1)

You could use the function findInterval where you specify the labels per quantile like this:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                        4, 4, 4),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010),
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                       1000, 0, 0, 1000, 7000))

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(B = factor(findInterval(A, c(-Inf, quantile(A, probs=c(0.33, .67)), Inf)), 
                    labels=c("lowest 33%","33-67%","highest 33%")))
#> # A tibble: 12 × 4
#> # Groups:   id [4]
#>       id  year     A B          
#>    <dbl> <dbl> <dbl> <fct>      
#>  1     1  2000     0 lowest 33% 
#>  2     1  2005  3000 highest 33%
#>  3     1  2010  1000 33-67%     
#>  4     2  2000  5000 highest 33%
#>  5     2  2005     0 lowest 33% 
#>  6     2  2010  2000 33-67%     
#>  7     3  2000  2000 highest 33%
#>  8     3  2005  1000 33-67%     
#>  9     3  2010     0 lowest 33% 
#> 10     4  2000     0 lowest 33% 
#> 11     4  2005  1000 33-67%     
#> 12     4  2010  7000 highest 33%

Created on 2022-07-30 by the reprex package (v2.0.1)

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • Thank you, this seems more straight forward than I thought it would be. But I'm not sure this solution calculates the quantiles for each year separately? Because what I'm interested in is the relative position of each id per year to account for shifts in the distribution over time. It seems like this gives me the relative position within each id across time. Can I just replace the group_by(id) with group_by(year)? – avocado1 Jul 30 '22 at 11:08
  • Hi @avocado1, I added some code. Is this what you mean? – Quinten Jul 30 '22 at 11:16
  • No sorry. I will try to explain it differently. My original data frame is a 3 year panel in long format with 6000+ ids, so 18000+ rows. Basically for each year I have to calculate the percentiles of A based on that specific years distribution of A. Think of three separate data frames per year and for each of those frames I need the ids position in the distribution of A. I don't know how to explain it better but I think exchanging group_by(id) with group_by(year) in your first solution comes closer to what I need. – avocado1 Jul 30 '22 at 11:35